RetroArch 1.11.1 release!

September 30, 2022March 8, 2024 Libretro Team

RetroArch 1.11.1 has just been released.

Grab it here.

If you’d like to learn more about upcoming releases, please consult our roadmap here.

Remember that this project exists for the benefit of our users, and that we wouldn’t keep doing this were it not for spreading the love to our users. This project exists because of your support and belief in us to keep going doing great things. We have always prioritized the endusers experience, and unlike others, we have never emburdened them with in-app ads, monetization SDKs or paywalled features, and we intend to continue to do so. If you’d like to show your support, consider donating to us. Check here in order to learn more. In addition to being able to support us on Patreon, there is now also the option to sponsor us on Github Sponsors! You can also help us out by buying some of our merch on our Teespring store!

NOTE: The Android version on Samsung Galaxy Store, Huawei AppGallery, and Amazon App Store will be updated soon. We will remove this notice when it has been updated. Until then, grab the APK from our site.

NOTE: Several size optimizations have been made to the packages. We no longer pre-install all of the optional XMB theme packs or other miscellaneous assets. Previously we also shipped autoconfig files that were irrelevant for that specific platform. By excluding these files from the package, we have managed to reduce the filesize and overall amount of files of RetroArch downloads/installs significantly. On consoles this will be very helpful where SD card/FTP installs can tend to be very slow.

If you still want to have all assets, you can go to Online Updater and select ‘Update Assets’. This will install all assets.

Changelog

1.11.1

GENERAL: Fix DEFAULT_FILL_TITLE_MACRO
NETWORKING: Add the const qualifier to some function parameters
NETWORKING/NETPLAY/UPNP: Add a private or CGNAT address warning to UPnP
SAVESTATES/SCREENSHOTS: Avoid ‘video_gpu_screenshot’ with savestates
UWP: Better ‘Save on quit’ fix

1.11.0

3DS: Add unique ID’s
3DS: Add bottom menu options
3DS: Set bottom_asset directory default
3DS: Only enable internal counter with CONSOLE_LOG defined
3DS: Set default bottom font values
3DS: Fix CIA installation issues
3DS: Support latest libctru
ANDROID: Add HAVE_ACCESSIBILITY
ANDROID: Gingerbread support
ANDROID: Touchpads support
ANDROID: Builtin Xperia Play autoconfig profile
ANDROID: Disable Feral GameMode for Android – only available on Linux
ANDROID: Add a configurable workaround for Android reconnecting devices
ANDROID/FDROID: Add F-Droid metadata to repo in Fastlane format
AUDIO/AUDIO MIXER: Add missing locks for thread safety
AUDIO/AUDIO MIXER: Fix audio mixer memory leak + remove redundant ‘single threaded’ rthreads implementation
AUTOSAVE: Change/improve exit behavior of autosave thread – if condition variable is signaled, the loop is ran another last time so we can do a final check/save before stopping the thread.
CDROM: Fix memory leak caught with asan – buf passed to filestream_read_file
CORE INFO/NETPLAY: Ensure current core info is initialized at runloop_event_init_core when netplay is enabled
CHEEVOS: Upgrade to rcheevos 10.4
CHEEVOS: Allow creating auto savestate in hardcore
CHEEVOS: prevent invalid memory reference if game has achievements but core doesn’t expose memory
CHEEVOS: Release achievement badge textures when video driver is deinitialized
CHEEVOS: Re-enforce hardcore limitations once achievements are loaded
CHEEVOS/MENU/MATERIALUI: Show achievement badge icons in MaterialUI driver
D3D9: D3D9 has been split up into two drivers – D3D9 HLSL (max compatibility, no shader support yet) and D3D9 Cg (dependent on deprecated Nvidia Cg runtime library)
D3D9/HLSL/XMB: XMB fix
D3D9/CG: D3D9 Cg driver fixed
D3D11: Fix overlay not showing up
D3D11/12: Reduce lag with WaitForVBlank – this rather simple addition seems to make D3D11/12 very very close to Vulkan/GLCore regarding input lag.
D3D11/12: Add waitable swapchains and max frame latency option
D3D11/12: Make waitable swapchains optional
DATABASE: Reformat ‘rdb_entry_int’ – Nitpick adjustments for database entries: Capitalize “Release Date”, and remove space before : from Release Date rows which use integer
DATABASE/EXPLORE: Allow On-Demand Thumbnails in Explore menu
DATABASE/EXPLORE/MENU/OZONE/XMB/RGUI: Explore menu thumbnails
DISC CONTROL: Better Disc Control append focus
DOS/DJGPP: Add a workaround for libc bug
AUTOMATIC FRAME DELAY: Added slowmotion resiliency
AUTOMATIC FRAME DELAY: Added string representation for seeing the current effective delay without opening statistics
AUTOMATIC FRAME DELAY: Added “ms” to logging and “(ms)” to label just like in Audio Latency
GENERAL: Don’t bake in OpenAL and libcaca by default unless explicitly enabled with configure switch.
GENERAL: Reduce amount of strlen calls
GENERAL: Reduce or simply sin/cosf calls
GFX: Fix readability and precision issues in aspectratio_lut
GFX: Add option to manually enable/disable automatic refresh rate switching
GFX: Enable automatic configuration of ‘VSync Swap Interval’
GFX/FONT/FREETYPE: Use FT_New_Memory_Face – first read it from file to memory beforehand –
this solves an asset extraction issue when selecting ‘Update Assets’ – apparently FT_New_Face keeps an open file handle to the font file which
prevents it from being overwritten/deleted while the program is still running.
GFX/THUMBNAILS: Thumbnail aspect ratio fix
GFX/THREADED VIDEO: Optimizations, fixes and cleanups
GFX/VIDEO FILTERS: Add Upscale_240x160-320×240 video filter with ‘mixed’ method
GLSLANG: Fix compilation with ./configure –disable-builtinglslang – was missing linking against -lMachineIndependent and -lGenericCodeGen static libs
INPUT: Fix off by one error for input_block_timeout setting. Also default to 0 for this setting (pretty massive performance gain)
INPUT: Analog button mapping fixes
INPUT/HID/OSX: Fix DualShock3 support
INPUT/HID/LINUX: (qb) Disable HAVE_HID by default for now for Linux as long as there are no working backends for both
INPUT/HID/WINDOWS: (qb) Disable HAVE_HID by default for now for Windows as long as there are no working backends for both
INPUT/HID/WIIU: Fix DualShock3 support
INPUT/OVERLAY: Block pointer input when overlay is pressed
INPUT/REMAPPING: input_remapping_save_file – existing remapping file was needlessly reloaded
INPUT/REMAPPING: Add option to disable automatic saving of input remap files
INPUT/LINUX/UDEV: Fix lightgun scaling on Y axis
INPUT/LINUX/X11/LED: Add LED keyboard driver
INPUT/WINDOWS/LED: LED keyboard driver cleanup
INPUT/WINDOWS/WINRAW: Clear key states when unfocused
INPUT/WINDOWS/WINRAW: Fix pointer device position
IOS: iOS app icon fixes & revisions
LIBRETRO/SAVESTATES: Implement an api call for context awareness
LOCALIZATION: Updates
LOCALIZATION: Add Catalan language option
LOCALIZATION: Fix some bad localization
LINUX: Make memfd_create call more backwards compatible by calling it through syscall – on older systems, you’ll have to include linux/memfd.h for the MFD_ defines, and call memfd_create() via the the syscall(2) wrapper (and include unistd.h and sys/syscall.h for it work). We exclude linux/memfd.h header include because we already provide the MFD_ defines in case they are missing
LINUX/MALI FBDEV: Fix assertion failed on video threaded switch
MENU: Menu paging navigation adjustments
MENU: New Menu Items for disabling Info & Search buttons in the menu
MENU: Allow the user to use volume up/down/mute hotkeys from within the menu
MENU: Add missing sublabels for non-running Quick Menu
MENU: Reorganize Quick Menu Information
MENU: Savestate thumbnails – Savestate slot reset action
MENU: Allow changing savestate slots with left/right on save/load
MENU: Add ‘Ago’ to playlist last played styles
MENU: Add proper icons for shader items
MENU/MATERIALUI: Add icon for ‘Download Thumbnails’
MENU/XMB: Add options for hiding header and horizontal title margin
MENU/XMB: Dynamic wallpaper fixes
MENU/XMB: Add Daite XMB Icon Theme
MENU/XMB/OZONE: Savestate thumbnail aspect ratio
MENU/XMB/OZONE: Core option category icon refinements
MENU/XMB/OZONE: Fullscreen thumbnail browsing
MENU/XMB/OZONE: Add playlist icons under ‘Load Content’
MENU/XMB/OZONE: Thumbnail improvements
MENU/XMB/OZONE: Savestate thumbnail fullscreen + dropdown
MENU/XMB/OZONE: Prevent unnecessary thumbnail requests when scrolling through playlists
MENU/OZONE: Fix playlist thumbnail mouse hover after returning from Quick Menu
MENU/OZONE: Thumbnail visibility corrections
MENU/OZONE: Playlist metadata reformat
MENU/OZONE: Savestate thumbnail fixes
MENU/OZONE: Add savestate thumbnails
MENU/OZONE: Header icon spacing adjustment
MENU/RGUI: Savestate thumbnails
MENU/SETTINGS: Turn Advanced Settings on by default, this entire filtering of settings will need a complete rethink anyways
MENU/WIDGETS: Widget color + position adjustments
MIYOO: Exclude unused HAVE_HID for Miyoo
MIYOO: Enable screenshots
MIYOO: Enable rewind
NETWORK: Allow MITM server selection on OK callback
NETWORK: Replace socket_select calls
NETWORK: Implement binary network streams
NETWORK: Poll support
NETWORK: Check connect errno for successful connection
NETWORK: Get rid of the timeout_enable parameter for socket_connect
NETWORK: Fix getnameinfo_retro’s port value for HAVE_SOCKET_LEGACY platforms
NETWORK: Define inet_ntop and inet_pton for older Windows versions
NETWORK: Define isinprogress function
NETWORK/NATT: Move natt files to “network”
NETWORK/NETWORK STREAMS: Add function netstream_eof
NETWORK/NETPLAY: Fix game CRC parsing
NETWORK/NETPLAY: Disable and hide stateless mode
NETWORK/NETPLAY: Change default for input sharing to “no sharing”
NETWORK/NETPLAY: Enforce a timeout during connection
NETWORK/NETPLAY: Disallow clients from loading states and resetting
NETWORK/NETPLAY: Special saves directory for client
NETWORK/NETPLAY: Ensure current content is reloaded before joining a host
NETWORK/NETPLAY: Fix client info devices index
NETWORK/NETPLAY: Fix input for some cores when hosting
NETWORK/NETPLAY: Memory leak fixes
NETWORK/NETPLAY: Force a core update when starting netplay
NETWORK/NETPLAY: Fix NAT traversal announce for HAVE_SOCKET_LEGACY platforms
NETWORK/NETPLAY: Refactor fork arguments
NETWORK/NETPLAY: Fix content reload deadlocks on static core platforms
NETWORK/NETPLAY: Disallow netplay start when content is not loaded for static core platforms
NETWORK/NETPLAY: Show client slowdown information
NETWORK/NETPLAY: Improve check frames menu entry
NETWORK/NETPLAY: Do not try to receive new data if the data is in the buffer
NETWORK/NETPLAY: Copy data on receive, even if the buffer is full
NETWORK/NETPLAY: Fix lobby sublabel CRC display on some platforms
NETWORK/NETPLAY: Support for customizing chat colors
NETWORK/NETPLAY: Small launch compatibility patch adjustments
NETWORK/NETPLAY: Support for banning clients
NETWORK/NETPLAY: Minor tweaks to the find content task
NETWORK/NETPLAY: Support for gathering client info and kicking
NETWORK/NETPLAY: Fix possible deadlock
NETWORK/NETPLAY: Initialize client’s allow_pausing to true
NETWORK/NETPLAY: Disable netplay for unsupported cores – with stateless mode being disabled for now, there is no reason not to include this. Refuse to initialize netplay when the current core is not supported (no proper savestates support)
NETWORK/NETPLAY/DISCOVERY: Ensure fixed width ints on packet struct
NETWORK/NETPLAY/DISCOVERY: Support for IPv4 tunneling (6to4)
NETWORK/NETPLAY/DISCOVERY/TASKS: Netplay/LAN Discovery Task refactor – aims to prevent blocking the main thread while awaiting for the LAN discovery timeout; This is accomplished by moving the whole discovery functionality into its task and using a non-blocking timer to finish the task. Also fixes discovery sockets not being made non-blocking, which could cause the main thread to hang for very long periods of time every pre-frame.
NETWORK/NETPLAY/TASKS: Find content task refactor – fixes many issues along the way, including a couple of nasty memory leaks that would leak thousands of bytes each time the task ran. It also expands the original concept by matching currently run content by filename (CRC matching is always performed first though).
NETWORK/NETPLAY/TASKS: Find content task refactor – Ensure CRC32 is 8 characters long
NETWORK/NETPLAY/LOBBY: Add setting for filtering out rooms with non-installed cores
NETWORK/NETPLAY/LOBBY: Hide older (incompatible) rooms
NETWORK/NETPLAY/LOBBY: Add a toggleable filter for passworded rooms. In addition, move lobby filters into its own submenu for better organization.
NETWORK/NETPLAY/MENU: Chat supported info for the host kick submenu
NETWORK/NETPLAY/MENU: Localize relay servers
NETWORK/NETPLAY/MENU: Host Ban Submenu
NETWORK/NETPLAY/MENU: Add client devices info to the kick sub-menu
NETWORK/NETPLAY/MENU: Path: Netplay -> Host -> Kick Client – Allows the host to kick clients. Allows the host to view client information: connected clients (names), status (playing/spectating) and ping.
NETWORK/NETPLAY/VITA: Add net_ifinfo support
NETWORK/NETPLAY/VITA: Enable partial LAN discovery
NETWORK/NETPLAY/VITA: Change default UDP port to 19492
NETWORK/NETPLAY/VITA: Do not multiply negative timeout values
NETWORK/NETPLAY/VITA: Fix epoll’s timeout parameter
NETWORK/NETPLAY/VITA: Launch compatibility patch
NETWORK/NETPLAY/3DS: Launch compatibility patch
NETWORK/NETPLAY/3DS: Adapt POLL for 3DS platform
NETWORK/NETPLAY/PS3: Launch compatibility patch
NETWORK/NETPLAY/WII: Enable net_ifinfo for some features. In practice, this only allows the netplay’s UPnP task to succeed on the Wii.
NETWORK/NETPLAY/WIIU: Launch compatibility patch
NETWORK/NETPLAY/SWITCH: Launch compatibility patch
NETWORK/UPNP: Attempt support for remaining platforms
NETWORK/UPNP: Support for IPv4 tunneling
ODROID GO2: Increase DEFAULT_MAX_PADS to 8 for ODROIDGO2, since that impacts the RG351[X] consoles. The RG351[X] have a USB host controller and can have an arbitrary number of USB gamepads.
ONLINE UPDATER: Online Updater menu reorganizing
OSX: Fixed items of system top menu bar on macOS
OSX: Revision to macOS app icon set
PLAYLISTS: Ensure history list will contain CRC32
PLAYLISTS: Fix CRC32 comparison – as state->content_crc has “|crc” suffix.
PS4/ORBIS: Orbis/PS4 Support using OrbisDev toolchain
PS4/ORBIS: Update xxHash dependency
PS4/ORBIS: Shader cache
RETROFW: Exclude unused HAVE_HID for RetroFW
RETROFW: Support battery indicator on RetroFW
RETROFW: Enable menu toggle button on retrofw devices
SHADERS: Shader Preset Loading of Multiple additional #references lines for settings
SHADERS: Shader Load Extra Parameter Reference Files – this adds the ability to put additional #reference lines inside shader presets which will load additional settings. The first reference in the preset still needs to point at a chain of presets which ends with a shader chain, and subsequent #reference lines will load presets which only have parameter values adjustment. This allows presets to be made with a modular selection of settings. For example with the Mega Bezel one additional reference could point at a preset which contained settings for Night mode vs Day mode, and another reference could point to a preset which contained settings for how much the screen should be zoomed in.
SHADERS/MENU: Increase shader scale max value
SCANNER/DC: Fix Redump bin/cue scan for some DC games
SCANNER/GC/WII: Add RVZ/WIA scan support for GC/Wii
SCANNER/PS1: Improved success rate of Serial scanning on PS1 by adding support for the xx.xxx format
SCANNER/PS1: Changed return value of detect_ps1_game function to actually return a failure when the Serial couldn’t be extracted. Scanner will then fallback on crc check, and usually ends up finding the games in the database.
SWITCH: Enable RWAV (WAV audio file) support
STRING: Do not assume char is unsigned
TASKS: More thread-awareness in task callbacks
TASKS: Fix race condition at task_queue_wait
TVOS: Revised tvOS icons w/ updated alien.
VFS: Fix various VFS / file stream issues
VULKAN: Fix more validation errors
VULKAN: Attempt to fix validation errors with HDR swapchain. Always use final render pass type equal to swapchain format. Use more direct logic to expose if filter chain emits HDR10 color space or not
VULKAN/ANDROID: Honor SUBOPTIMAL on non-Android since you’d want to recreate swapchains then. On Android it can be promoted to SUCCESS.

SUBOPTIMAL_KHR can happen there when rotation (pre-rotate) is wrong.

VULKAN/DEBUG: Automatically mark buffer/images/memory with names
VULKAN/DEBUG: Move over to VK_EXT_debug_utils. Debug marker is deprecated years ago.
VULKAN/HDR: Fix leak of HDR UBO buffer
VULKAN/BFI: Fix BFI (Black Frame Insertion) regression
WINDOWS: Fix exclusive fullscreen video refresh rate when vsync swap interval is not equal to one – refresh rate in exclusive fullscreen mode was being incorrectly multiplied by vsync swap interval, breaking swap interval functionality at the gfx driver level
WIN32: Do optimization for Windows where we only update the title with SetWindowText when the previous title differs from the current title
WIN32: Skip console attach when logging to file
WIN32: Remove black margins with borderless non-fullscreen window
WIN32/TASKBAR: Release ITaskbarList3 on failed HrInit – pointer wasn’t NULL’d, thus set_window_progress would cause weird behavior
WII/GX: Fix potential datarace
WIIU: Implement sysconf and __clear_cache
WIIU: Add OS memory mapping imports
UWP: Added launch protocol arg ‘forceExit’ so a frontend can tell an already-running RetroArch UWP instance to quit.
UWP: Enable core downloader/updater
UWP: Remove copy permissions as its inefficient as we can just directly assign the new ACL and that works
Xbox/UWP: Remove expandedResources
Xbox/UWP: UWP OnSuspending crash fix
Xbox/UWP: Enable savestate file compression by default for UWP/Xbox – got told there are no more issues with it
Xbox/UWP: Add support for 4k to angle on xbox for MSVC2017 build

ParaLLEl-RDP – How the upscaled rendering works

June 19, 2020August 17, 2021 Maister

This is a technical article on how upscaling in LLE works on the N64 RDP. Accurate upscaling in LLE is something which has not been done before (it has been done in a HLE framework, but accurate is the key word here), due to its extremely intense performance requirements, but with paraLLEl-RDP running on the GPU with Vulkan, this is now practical, and the results are faithful to what N64 games would look like if games rendered at a very high resolution. There are no compromises on accuracy, and I believe this is a correct representation of upscaling in a “what-if” scenario. The changes required to add this were actually fairly minimal, and there aren’t really any hacks involved. However, we have to be somewhat conservative in what we attempt to enhance.

Main concepts

Unified Memory Architecture – fully accurate frame buffer behavior

A complicated problem with the N64 is that the RDP and CPU have a unified memory architecture, and this complicates a lot. We must assume that the CPU can read arbitrary pixels that the RDP rendered, and the CPU can overwrite pixels written by the RDP earlier. In upscaling, this gets weird very quickly since the CPU does not understand upscaling. To support this, the GPU renders everything twice, once in the native domain, and finally in the upscaled domain. With this approach, the CPU cannot observe that upscaling is happening. It also improves performance in synchronous mode, since we can just render native resolution before we unblock CPU, and the GPU can go on to render upscaled render passes asynchronously, which takes a longer time.

Rasterization at sub-pixel precision

The core mathematical problem to solve for upscaling is how we are going to rasterize at sub-pixel precision. This gets somewhat interesting, since the RDP is fully defined in fixed-point, and there is limited precision available. Fortunately, there are enough bits of precision that we can add extra sub-pixel precision to the rasterization equations. 8x is the theoretically maximum upscaling we can achieve without going beyond 32-bit fixed point math. 8x is complete overkill, 2x and 4x are more than enough anyways.

Instancing RDRAM

Given that we have a requirement of unified memory architecture, paraLLEl-RDP directly implements a unified memory architecture (UMA) as mentioned above where the GPU reads and writes directly into RDRAM. This ensures full accuracy, and this is usually where HLE fails, as implementing UMA at this level is not practical with the traditional graphics pipeline in GPUs. To extend paraLLEl-RDP’s approach to upscaling, I went with multiple copies of RDRAM, one copy for each sub-sample. This works really well, because at any time, if we detect that any write happens in an unscaled context, e.g. CPU writes, we can simply duplicate samples up to upscaled domain. This is essentially some kind of faux MSAA where each pixel has multiple samples associated with it. This is the memory we end up allocating for a 4x upscale (4×4 = 16 samples):

RDRAM (8 MB) – Allocated on host with VK_EXT_external_memory_host. This is fully coherent with emulated CPU.
Hidden RDRAM (4 MB) – Device local
RDRAM reference buffer (8 MB) – Device local
Multisampled RDRAM (8 * 16 MB) – Device local
Multisampled Hidden RDRAM (4 * 16 MB) – Device local

The reference buffer is there so we can track when CPU writes to RDRAM. Essentially, before we render anything on the GPU, we compare RDRAM against the reference buffer. If there is a difference, the CPU must have clobbered the pixel, and the RDRAM is now duplicated to all the samples of RDRAM. After rendering something, we update the reference buffer, so we know it’s safe to use upscaled pixels later.

When rendering an upscaled pixel (X, Y), we convert the coordinate to native pixel (X, Y) and convert the sub-pixel to an RDRAM instance, e.g.:

ivec2 upscaled_pixel = ivec2(x, y);
ivec2 subpixel = upscaled_pixel & (SCALING_FACTOR - 1);
ivec2 native_pixel = upscaled_pixel >> SCALING_LOG2;
int rdram_instance = subpixel.y * SCALING_FACTOR + subpixel.x;
read_write_rdram(native_pixel, rdram_instance);

Upscaled VI interface

Adding upscaling to the VI interface is fairly straight forward since we can convert e.g. 16 samples back to a 4×4 block of pixels. From there, we just follow the exact same algorithms that we do for native rendering. This means we get correct VI AA, divot and de-dither happening at high resolution.

Modifying rasterization rules

The RDP is a span rasterizer, a very classic design. The rasterization rules are extremely specific and cannot be accurately represented using normal OpenGL/Vulkan triangle rasterization rules, which are based on barycentric plane equations (to the best of my knowledge you can only approximate).

The RDP receives pre-computed triangle setup data from the RSP. We specify three lines with the triangle setup, where one line is the “major” line XH, and a second line is picked from the two “minor” lines XM/XL, depending on y >= YM. Two values YH and YL limit which scanlines we should render. This lets us implement triangles, or more complicated primitives if we want to. Bisqwit made a really cool ongoing video series on software rendering a while back which also implements a span rasterizer, which is very useful to watch if you want a deeper understanding of this approach.

This triangle setup data is defined more specifically as:

XH, XM, XL: 32-bit values in the format of s12.15.x. The 4 MSB are sign-extended, and the single LSB is ignored (we can exploit this bit for more precision later!)
dXHdy, dXMdy, dXLdy: 32-bit values in the format of s12.13.xxx. 4 MSBs are sign-extended, and 3 LSBs are ignored. This represents the slope of the line for XH, XM and XL.
YH: This is a s12.2 value which represents the first scanline we render. There is 2 bits of subpixel precision, which is very useful because the RDP will sample coverage for 4 sub-scanlines per scanline.
YM: This s12.2 value represents the first sub-scanline where XL is selected as the minor line, otherwise XM is used.
YL: This represents the final sub-scanline which is rendered. The sub-scanline of YL is not included in rasterization.

The algorithm for native resolution in GLSL:

// Interpolate X at all 4 Y-subpixels.

// Check Y dimension.
int yh_interpolation_base = int(setup.yh) & ~(SUBPIXELS - 1);
int ym_interpolation_base = int(setup.ym);

int y_sub = int(y * SUBPIXELS);
ivec4 y_subs = y_sub + ivec4(0, 1, 2, 3);

// dxhdy and others are (setup value >> 2) since we're stepping one sub-scanline at a time, not whole lines. This is why more LSBs are ignored for the slopes.
ivec4 xh = setup.xh + (y_subs - yh_interpolation_base) * setup.dxhdy;
ivec4 xm = setup.xm + (y_subs - yh_interpolation_base) * setup.dxmdy;
ivec4 xl = setup.xl + (y_subs - ym_interpolation_base) * setup.dxldy;
xl = mix(xl, xm, lessThan(y_subs, ivec4(setup.ym)));

ivec4 xh_shifted = quantize_x(xh); // A very specific quantizer, see source ...
ivec4 xl_shifted = quantize_x(xl);

ivec4 xleft, xright;
if (flip) // Flip is a bit set in triangle setup to mark primitive winding.
{
    xleft = xh_shifted;
    xright = xl_shifted;
}
else
{
    xleft = xl_shifted;
    xright = xh_shifted;
}

We have now computed a range of which pixels to render for each sub-scanline, where [xleft, xright) is the range. If xright <= xleft, the sub-scanline does not receive coverage. The quantizer is somewhat esoteric, but we essentially quantize X down to 8 sub-pixels of precision (>> 13). This is used later for multi-sampled coverage in the X dimension.

To add upscaling, the modifications are straight forward.

int yh_interpolation_base = int(setup.yh) & ~(SUBPIXELS - 1);
int ym_interpolation_base = int(setup.ym);
yh_interpolation_base *= SCALING_FACTOR;
ym_interpolation_base *= SCALING_FACTOR;

int y_sub = int(y * SUBPIXELS);
ivec4 y_subs = y_sub + ivec4(0, 1, 2, 3);

// Interpolate X at all 4 Y-subpixels.
ivec4 xh = setup.xh * SCALING_FACTOR + (y_subs - yh_interpolation_base) * setup.dxhdy;
ivec4 xm = setup.xm * SCALING_FACTOR + (y_subs - yh_interpolation_base) * setup.dxmdy;
ivec4 xl = setup.xl * SCALING_FACTOR + (y_subs - ym_interpolation_base) * setup.dxldy;
xl = mix(xl, xm, lessThan(y_subs, ivec4(SCALING_FACTOR * setup.ym)));

This is an accurate representation, as the only thing we do here is to shift in more bits into triangle setup, as long as this does not overflow, we’re golden. After this step, we have scissoring. Scissor coordinates are u10.2 fixed point, so it means the maximum resolution for the RDP is 1024×1024. With 8x upscale and 8 sub-pixels of X precision, we can barely pack the resulting range in unsigned 16-bits without overflow.

Modifying varying interpolation

Attribute interpolation is a little more interesting. There are 8 varyings, which all have the same setup data:

Shade Red/Green/Blue/Alpha
S
T
1/W
Z

Each varying has 4 values:

Base value – sampled at coordinate (XH, YH) (kinda … it’s complicated)
dVdx – Change in value for 1 pixel in X dimension
dVde – Change in value when following the major axis down one line, and sampling at the next line’s XH. Basically dVde = dVdx * dXdy + dVdy. I’m not sure why this even exists, it makes the interpolation math a little easier I suppose?
dVdy – This feels very redundant, but it is what it is. It is only used for coverage fixup and LOD computation.

We cannot shift in extra bits here, unlike rasterization, so we have to be a little creative here. To stay faithful, and avoid overflow, we need to ensure that the interpolation is correct for each sample point which matches sample points for native resolution, and for the inner sub-pixels, we remove some bits of precision in the derivative. Essentially, instead of doing something like this (not the correct math, see code, here for brevity):

int base_interpolated_x = ((setup.xh + (y - base_y) * setup.dxhdy)) >> 16;
rgba = attr.rgba;
int dy = y - base_y;
int dx = x - base_interpolated_x;
rgba += dy * attr.drgba_de;
rgba += dx * attr.drgba_dx;

we do …

int base_interpolated_x = ((setup.xh + (y - base_y) * setup.dxhdy)) >> 16;
rgba = attr.rgba;
int dy = y - base_y;
int dx = x - base_interpolated_x;
rgba += (dy >> SCALING_LOG2) * attr.drgba_de + (dy & (SCALING_FACTOR - 1)) * (attr.drgba_de >> SCALING_LOG2);
rgba += (dx >> SCALING_LOG2) * attr.drgba_dx + (dx & (SCALING_FACTOR - 1)) * (attr.drgba_dx >> SCALING_LOG2);

The added error here is microscopic.

Workarounds

Some games do not work correctly when we upscale, since the game never intended to render sub-pixels. This usually comes into play in two major scenarios, which we need to workaround.

Using LOD for clever hackery

The mip-mapping on N64 is quite flexible, and sometimes two entirely different textures represent LOD 0 and LOD 1 for smooth distance based effects. When upscaling with e.g. 4x, we essentially get a LOD factor which is a LOD bias of -2 (log2(1/4)). An optional workaround is to compensate by applying a positive LOD bias ourselves to emit LOD levels the game expects. Ideally, this workaround is applied only in places where it’s needed.

Sprite rendering / TEX_RECT

Many games render sprites with TEX_RECT with the expectation that textures are rendered 1:1 with input texels to output texels. When we start upscaling, the game might have forgot to disable bilinear filtering, and we start filtering outside the texture boundaries, i.e., against garbage, which shows up as ugly seams in the image. The simple workaround is to render TEX_RECT primitives as if they are not upscaled. This is necessary anyways for the COPY pipe, since the COPY pipe only updates the varying interpolator every 8th framebuffer byte. We cannot safely upscale these kinds of primitives either way.

Conclusion

There isn’t much more to it. Adding upscaling to ParaLLEl-RDP was not all that complicated compared to the other insanity that went into making this renderer work. It’s a principled approach to the upscaling which I believe could theoretically work in a custom RDP hardware design.

paraLLEl N64 RDP – Android support and Intel iGPU improvements – What you should know (and what to expect)

June 2, 2020August 17, 2021 Libretro Team

Ridge Racer 64 running on Parallel RDP on an Android phone (with RetroArch)

Themaister wrote an article a few days ago talking in-depth about all the work that has gone into ParaLLEl RDP since launch.

Two of the important things discussed in this article were:
* Intel iGPU performance
* Android support

What you might not have realized from reading the article is that with the right tweaks, you can already get ParaLLEl RDP to run reasonably well. As indicated in the article he wrote, Themaister will be looking at WSI Vulkan issues specifically related to RetroArch since there definitely do seem to be some issues that have to be resolved. In the meantime, we have to resort to some workarounds. Workarounds or not, they will do the job for now.

How to install and set it up

In RetroArch, go to Online Updater.
(If you have paraLLEl N64 already installed) – Select ‘Update Installed Cores’. This will update all the cores that you already installed.
(If you don’t have paraLLEl N64 installed already) – go to ‘Core Updater’, and select ‘Nintendo – Nintendo 64 (paraLLEl N64)’.
Now start up a game with this core.
Go to the Quick Menu and go to ‘Options’. Scroll down the list until you reach ‘GFX Plugin’. Set this to ‘parallel’. Set ‘RSP plugin’ to ‘parallel’ as well.
For the changes to take effect, we now need to restart the core. You can either close the game or quit RetroArch and start the game up again.

Intel iGPU

What you should do for optimum performance right now:

For Intel iGPU, I have found that what makes the biggest difference by far (on Windows 10 at least) is to run it in windowed mode instead of fullscreen. Fullscreen mode will have horribly crippled performance by comparison.

Performance

Once you have done this, the performance will actually not be that far behind with a run-off-the-mill iGPU from say a 2080 Ti (in asynchronous mode). Sure, it’s still a bit slower by about ~30fps, but it’s no longer the massive gulf in performance it was before where even Angrylion was beating ParaLLEl RDP in the performance department.

With synchronous, the difference between say a 2080 Ti and an iGPU should be a bit more pronounced.

Hopefully in future RetroArch versions, it will no longer be necessary to have to resort to windowed mode for good performance with Intel iGPUs. For now, this workaround will do.

Android

What you should do for optimum performance right now:

Turn vsync off. Go to Settings -> Video -> Synchronization, and make sure that ‘Vertical Sync (Vsync)’ is disabled.

NOTE: It is imperative that you turn V-Sync off for now. If not, performance will be so badly crippled that even Angrylion will be faster by comparison. Fortunately, there will be no noticeable screen tearing even with Vsync disabled right now.

Performance

I tested ParaLLEl RDP on two devices:

Nvidia Shield TV (2015)
Samsung Galaxy S10 Plus (2019) [European Exynos model]

NOTE: The European model of the Galaxy S10 Plus used here has the Samsung Exynos SoC (System-On-A-Chip). Generally these perform worse than the US models of the Galaxy phones, which use a Qualcomm Snapdragon SoC instead. You should therefore expect significantly better performance on a US model.

Performance on Shield TV

Here are some rough performance figures for the Nvidia Shield TV –

Title	Performance
Mortal Kombat Trilogy	87 to 94fps
Yoshi’s Story	99fps
Doom 64	90 to 117fps
Tetris 64	117fps
Starcraft 64	177fps

It’s hard to put an exact number on other games, but just from a solely gameplay-focused perspective, you can get a near-locked framerate with games like Legend of Zelda: Ocarina of Time and Super Mario 64 if you run the PAL versions (which limit the framerate to 50fps instead of 60fps with NTSC versions). There might still be the odd frame drop in certain graphics intensive scenes but nothing too serious.

Similarly, games like 1080 Snowboarding drop below fullspeed with the NTSC version, but running them with the PAL version is nearly a locked framerate in all but the most intensive scenes.

Performance on Samsung Galaxy S10 Plus

Performance on a high-end 2019 phone like the Galaxy S10 Plus can tend to be more variable, probably because of the aggressive dynamic throttling being done on phones. Sometimes performance would be a significant step above the Shield TV where it could run NTSC versions of games like Legend of Zelda: Ocarina of Time and Super Mario 64 at fullspeed with no problem (save for the very odd frame drop here and there in very rare scenes), and then at other times it would perform similarly to a Shield TV. Your mileage may vary there.

Conclusions

Overall, it’s clear that certain battles have to be won on the Vulkan side, especially when it comes down to having to disable vsync at all so far for acceptable performance.

We’d like to learn more from people who have a Samsung Galaxy S20 or a similar high end phone released in 2020. Even a Snapdragon version of the S10 Plus would produce better results than what we see here.

So, Low-Level N64 emulation, is it attainable on Android? Yes, with the proper Vulkan extensions, and provided you have a reasonably modern and fast high end phone. The Shield TV is also a decent mid-range performer considering its age. Far from every game runs at fullspeed yet but the potential is certainly there for this to be a real alternative to HLE based N64 emulation on Android as hardware grows more powerful over the years.

FAQ

Some specific issues should be addressed –

Game compatibility is significantly lower on Android right now

The mupen64plus-core part of ParaLLEl N64 is older than the one found in Mupen64plus next. While on PC this is not so much of an issue because of the generally mature (but slower) Hacktarux dynarec, on ARM platforms it is a different story since new_dynarec was in a premature state back then. Not only that, LLE RDP + RSP plugin compatibility with new_dynarec was not even a consideration back then. So some games might not work at all right now with Parallel RDP+RSP on Android.

ParaLLEl N64 will likely receive a mupen64plus-core update soon, and Mupen64Plus Next might also in the near future get ParaLLEl RDP + ParaLLEl RSP support. So this situation will sort itself out.

You get a display error showing ‘ERR’ on your Android device

The Vulkan driver for your GPU is likely missing these two Vulkan extensions, which ParaLLEl RDP requires.

VK_KHR_8bit_storage
VK_KHR_16bit_storage

(Intel iGPU) Performance is halved (or more) in fullscreen mode

Known issue, read above. These issues have been identified and it’s a matter of finding the appropriate solution for these issues.

Reviving and rewriting paraLLEl-RDP – Fast and accurate low-level N64 RDP emulation

May 13, 2020August 17, 2021 Maister

Over the last few months after completing the paraLLEl-RSP rewrite to a Lightrec based recompiler, I’ve been plugging away on a project which I had been putting off for years, to implement the N64 RDP with Vulkan compute shaders in a low-level fashion. Every design of the old implementation has been scrapped, and a new implementation has arisen from the ashes. I’ve learned a lot of advanced compute techniques, and I’m able to use far better methods than I was ever able to use back in the early days. This time, I wanted to do it right. Writing a good, accurate software renderer on a massively parallel architecture is not easy and you need to rethink everything. Serial C code will get you nowhere on a GPU, but it’s a fun puzzle, and quite rewarding when stuff works.

The new implementation is a standalone repository that could be integrated into any emulator given the effort: https://github.com/Themaister/parallel-rdp. For this first release, I integrated it into parallel-n64. It is licensed as MIT, so feel free to integrate it in other emulators as well.

Why?

I wanted to prove to myself that I could, and it’s … a little fun? I won’t claim this is more than it is. 🙂

Chasing bit-exactness

The new implementation is implemented in a test-driven way. The Angrylion renderer is used as a reference, and the goal is to generate the exact same output in the new renderer. I started writing an RDP conformance suite. Here, we generate RDP commands in C++, run the commands across different implementations, and compare results in RDRAM (and hidden RDRAM of course, 9-bit RAM is no joke). To pass, we must get an exact match. This is all fixed-point arithmetic, no room for error! I’ve basically just been studying Angrylion to understand what on earth is supposed to happen, and trying to make sense of what the higher level goal of everything is. In LLE, there’s a lot of weird magic that just happens to work out.

I’m quite happy with where I’ve ended up with testing and seeing output like this gives me a small dopamine shot before committing:

…

122/163 Test #122: rdp-test-interpolation-color-texture-ci4-tlut-ia16 ………………………….. Passed 2.50 sec
Start 123: rdp-test-interpolation-color-texture-ci8-tlut-ia16
123/163 Test #123: rdp-test-interpolation-color-texture-ci8-tlut-ia16 ………………………….. Passed 2.40 sec
Start 124: rdp-test-interpolation-color-texture-ci16-tlut-ia16
124/163 Test #124: rdp-test-interpolation-color-texture-ci16-tlut-ia16 …………………………. Passed 2.37 sec
Start 125: rdp-test-interpolation-color-texture-ci32-tlut-ia16
125/163 Test #125: rdp-test-interpolation-color-texture-ci32-tlut-ia16 …………………………. Passed 2.45 sec
Start 126: rdp-test-interpolation-color-texture-2cycle-lod-frac
126/163 Test #126: rdp-test-interpolation-color-texture-2cycle-lod-frac ………………………… Passed 2.51 sec
Start 127: rdp-test-interpolation-color-texture-perspective
127/163 Test #127: rdp-test-interpolation-color-texture-perspective ……………………………. Passed 2.50 sec
Start 128: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac
128/163 Test #128: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac ……………… Passed 3.29 sec
Start 129: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac-sharpen
129/163 Test #129: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac-sharpen ………. Passed 3.26 sec
Start 130: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac-detail
130/163 Test #130: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac-detail ……….. Passed 3.48 sec
Start 131: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac-sharpen-detail
131/163 Test #131: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac-sharpen-detail … Passed 3.26 sec
Start 132: rdp-test-texture-load-tile-16-yuv

…

151/163 Test #151: vi-test-aa-none …………………………………………………………. Passed 21.19 sec
Start 152: vi-test-aa-extra-dither-filter
152/163 Test #152: vi-test-aa-extra-dither-filter ……………………………………………. Passed 48.77 sec
Start 153: vi-test-aa-extra-divot
153/163 Test #153: vi-test-aa-extra-divot …………………………………………………… Passed 64.29 sec
Start 154: vi-test-aa-extra-dither-filter-divot
154/163 Test #154: vi-test-aa-extra-dither-filter-divot ………………………………………. Passed 65.90 sec
Start 155: vi-test-aa-extra-gamma
155/163 Test #155: vi-test-aa-extra-gamma …………………………………………………… Passed 48.28 sec
Start 156: vi-test-aa-extra-gamma-dither
156/163 Test #156: vi-test-aa-extra-gamma-dither …………………………………………….. Passed 48.18 sec
Start 157: vi-test-aa-extra-nogamma-dither
157/163 Test #157: vi-test-aa-extra-nogamma-dither …………………………………………… Passed 47.56 sec

…

100% tests passed, 0 tests failed out of 163 #feelsgoodman

Ideally, if someone is clever enough to hook up a serial connection to the N64, it might be possible to run these tests through a real N64, that would be interesting.

I also fully implemented the VI this time around. It passes bit-exact output with Angrylion in my tests and there is a VI conformance suite to validate this as well. I implemented almost the entire thing without even running actual content. Once I got to test real content and sort out the last weird bugs, we get to the next important part of a test-driven development workflow …

The importance of dumping formats

A critical aspect of verifying behavior is being able to dump RDP commands from the emulator and replay them.

On the left I have Angrylion and on the right paraLLEl-RDP running side by side from a dump where I can step draw by draw, and drill down any pesky bugs quite effectively. This humble tool has been invaluable. The Angrylion backend in parallel-n64 can be configured to generate dumps which are then used to drill down rendering bugs offline.

Compatibility

The compatibility is much improved and should be quite high, I won’t claim its perfect, but I’m quite happy with it so far. We went through essentially all relevant titles during testing (just the first few minutes), and found and fixed the few issues which popped up. Many games which were completely broken in the old implementation now work just fine. I’m fairly confident that those bugs are solvable this time around though if/when they show up.

Implementation techniques

With Vulkan in 2020 I have some more tools in my belt than was available back in the day. Vulkan is a quite capable compute API now.

Enforcing RDRAM coherency

A major pain point of any N64 emulator is the fact that RDRAM is shared for the CPU and RDP, and games sure know how to take advantage of this. This creates a huge burden on GPU-accelerated implementations as we now have to ensure full coherency to make it accurate. Most HLE emulators simply don’t care or employ complicated heuristics and workarounds, and that’s fine, but it’s not good enough for LLE.

In the previous implementation, it would try to do “framebuffer manager” techniques similar to HLE emulators, but this was the wrong approach and lead to a design which was impossible to fix. What if … we just import RDRAM as buffer straight into the Vulkan driver and render to that, wouldn’t that be awesome? Yes … yes, it would be, and that’s what I did. We have an obscure, but amazing extension in Vulkan called VK_EXT_external_memory_host which lets me import RDRAM from the emulator straight into Vulkan and render to it over the PCI-e bus. That way, all framebuffer management woes simply disappear, I render straight into RDRAM, and the only thing left to do is to handle synchronization. If you’re worried about rendering over the PCI-e bus, then don’t be. The bandwidth required to write out a 320×240 framebuffer is absolutely trivial especially considering that we’re doing …

Tile-based rendering

The last implementation was tile-based as well, but the design is much improved. This time around all tile binning is done entirely on the GPU in parallel, using techniques I implemented in https://github.com/Themaister/RetroWarp, which was the precursor project for this new paraLLEl-RDP. Using tile-based rendering, it does not really matter that we’re effectively rendering over the PCI-e bus as tile-based rendering is extremely good at minimizing external memory bandwidth. Of course, for iGPU, there is no (?) external PCI-e bus to fight with to begin with, so that’s nice!

Ubershaders with asynchronous pipeline optimization

The entire renderer is split into a very small selection of Vulkan GLSL shaders which are precompiled into SPIR-V. This time, I take full advantage of Vulkan specialization constants which allow me to fine-tune the shader for specific RDP state. This turned out to be an absolute massive win for performance. To avoid the dreaded shader compilation stutter, I can always fallback to a generic ubershader while pipeline is being compiled which is slow, but works for any combination of state. This is a very similar idea to what Dolphin pioneered for emulation a few years ago.

8/16-bit integer support

Memory accesses in the RDP are often 8 or 16 bits, and thus it is absolutely critical that we make use of 8/16-bit storage features to interact directly with RDRAM, and if the GPU supports it, we can make use of 8 and 16-bit arithmetic as well for good measure.

Async compute

Async compute is critical as well, since we can make the async compute queue high priority and ensure that RDP shading work happens with very low latency, while VI filtering and frontend shaders can happily chug along in the fragment/graphics queue. Both AMD and NVIDIA now have competent implementations here.

GPU-driven TMEM management

A big mistake I made previously was doing TMEM management in CPU timeline, this all came crashing down once we needed framebuffer effects. To avoid this, all TMEM uploads are now driven by the GPU. This is probably the hairiest part of paraLLEl-RDP by far, but I have quite a lot of gnarly tests to test all the relevant corner cases. There are some true insane edge cases that I cannot handle yet, but the results created would be completely meaningless to any actual content.

Performance

To talk about FPS figures it’s important to consider the three major performance hogs in a low-level N64 emulator, the VR4300 CPU, the RSP and finally the RDP. Emulating the RSP in an LLE fashion is still somewhat taxing, even with a dynarec (paraLLEl-RSP) and even if I make the RDP infinitely fast, there is an upper bound to how fast we can make the emulator run as the CPU and RSP are still completely single threaded affairs. Do keep that in mind. Still, even with multithreaded Angrylion, the RDP represents a quite healthy chunk of overhead that we can almost entirely remove with a GPU implementation.

GPU bound performance

It’s useful to look at what performance we’re getting if emulation was no constraint at all. By adding PARALLEL_RDP_BENCH=1 to environment variables, I can look at how much time is spent on GPU rendering.

Playing on an GTX 166o Ti outside the castle in Mario 64:

[INFO]: Timestamp tag report: render-pass
[INFO]: 0.196 ms / frame context
[INFO]: 0.500 iterations / frame context

We’re talking ~0.2ms on GPU to render one frame on average, hello theoretical 5000 VI/s … Somewhat smaller frame times can be observed on my Radeon 5700 XT, but we’re getting frame rates so ridiciously high they become meaningless here. We’ve tested it on quite old cards as well and the difference in FPS on even something ancient like an R9 290x card and a 2080 Ti is minimal since the time now spent in RDP rendering is completely irrelevant compared to CPU + RSP workloads. We seem to be getting about a 50-100% uplift in FPS, which represents the shaved away overhead that the CPU renderer had. Hello 300+ VI/s!

Unfortunately, Intel iGPU does not fare as well, with an overhead high enough that it does not generally beat multithreaded Angrylion running on CPU. I was somewhat disappointed by this, but I have not gone into any real shader optimization work. My early analysis suggests extremely poor occupancy and a ton of register spilling. I want to create a benchmark tool at some point to help drill down these issues down the line.

It would be interesting to test on the AMD APUs, but none of us have the hardware handy sadly 🙁

Synchronous vs Asynchronous RDP

There are two modes for the RDP. In async mode, the emulation thread does not wait for the GPU to complete rendering. This improves performance, at the cost of accuracy. Many games unfortunately really rely on the unified memory architecture of the N64. The default option is sync, and should be used unless you have a real need for speed, or the game in question does not need sync.

Here we see an example of broken blob shadows caused by async RDP in Jet Force Gemini. This happens because the CPU is actually reading the shadowmap rendered by the RDP, and blurring it on the CPU timeline (why on earth the game would do that is another question), then reuploading it to the RDP. These kinds of effects require very tight sync between CPU and GPU and comes up in many games. N64 is particularly notorious for these kinds of rendering challenges.

Of course, given how fast the GPU implementation is on discrete GPUs, sync mode does not really pose an issue. Do note that since we’re using async compute queues here, we are not stalling on frontend shading or anything like that. The typical stall times on the CPU is in the order of 1 ms per frame, which is very acceptable. That includes the render thread doing its thing, submitting that to GPU, getting it executed and coming back to CPU, which has some extra overhead.

Road-map for future improvement

I believe this is solid enough for a first release, but there are further avenues for improvement.

Figure out poor performance on Intel iGPU

There is something going on here that we should be able to improve.

Implement a workaround for implementations without VK_EXT_external_memory_host (EDIT: Now implemented as of 2020-05-18)

Unfortunately there is one particular driver on desktop which doesn’t support this, and that’s NVIDIA on Linux (Windows has been supported since 2018 …). Hopefully this gets implemented soon, but we will need a fallback. This will get ugly since we’ll need to start shuffling memory back and forth between RDRAM and a GPU buffer. Hopefully the async transfer queue can help make this less painful. It might also open up some opportunities for mobile, which also don’t implement this extension as we speak. There might also be incentives to rewrite some fundamental assumptions in the N64 emulator plugin specifications (can we please get rid of this crap …). If we can let the GPU backend allocate memory, we don’t need any fancy extension, but that means uprooting 20 years of assumptions and poking into the abyss … Perhaps a new implementation can break new ground here (hi @ares_emu!).

EDIT: This is now done! Takes a 5-10% performance hit in sync mode, but the workaround works quite well. A fine blend of masked SIMD moves, a writemask buffer, and atomics …

Internal upscaling?

It is rather counter-intuitive to do upscaling in an LLE emulator, but it might yield some very interesting results. Given how obscenely fast the discrete GPUs are at this task, we should be able to do a 2x or maybe even 4x upscale at way-faster-than-realtime speeds. It would be interesting to explore if this lets us avoid the worst artifacts commonly associated with upscaling in HLE.

Fancier deinterlacer?

Some N64 content runs at 480i, and we can probably spare some GPU cycles running a fancier deinterlacer 😉

Esoteric use cases?

PS1 wobbly polygon rendering has seen some kind of resurgence in the last years in the indie scene, perhaps we’ll see the same for the fuzzy N64 look eventually. With paraLLEl-RDP, it should be possible to build a rendering engine around a N64-lookalike game. That would be cool to see.

Conclusion

This is a somewhat esoteric implementation, but I hope I’ve inspired more implementations like this. Compute-based accurate renderers will hopefully spread to more systems that have difficulties with accurate rendering. I think it’s a very interesting topic, and it’s a fun take on emulation that is not well explored in general.

paraLLEl-RDP rewritten from scratch – available in paraLLEl n64 right now for RetroArch

May 13, 2020August 17, 2021 Libretro Team

The ParaLLEl N64 Libretro core has received an update today that adds the brand new paraLLEl-RDP Vulkan renderer to the emulator core.

I implore everybody to read Themaister’s blog post (Reviving and rewriting paraLLEl-RDP – Fast and accurate low-level N64 RDP emulation) for a deep dive into this new renderer.

Requirements

You need a graphics card that supports the Vulkan graphics API.
It’s currently only available on Windows and Linux.
Right now the renderer requires a specific Vulkan extension, called ‘VK_EXT_external_memory_host’. Only Nvidia Linux binary drivers for Vulkan currently doesn’t support this extension. It has been requested but there is no ETA yet on when they will implement this.

What’s new since the old ParaLLEl RDP?

Completely rewritten from the ground up
Bit-exact renderer
Should be pretty much on par with Angrylion accuracy-wise now – none of the issues that plagued the old paraLLEl RDP
Now emulates the VI (Video Interface) as well
Basic deinterlacing for interlaced video modes

How to install and set it up

In RetroArch, go to Online Updater.
(If you have paraLLEl N64 already installed) – Select ‘Update Installed Cores’. This will update all the cores that you already installed.
(If you don’t have paraLLEl N64 installed already) – go to ‘Core Updater’, and select ‘Nintendo – Nintendo 64 (paraLLEl N64)’.
Now start up a game with this core.
Go to the Quick Menu and go to ‘Options’. Scroll down the list until you reach ‘GFX Plugin’. Set this to ‘parallel’. Set ‘RSP plugin’ to ‘parallel’ as well.
For the changes to take effect, we now need to restart the core. You can either close the game or quit RetroArch and start the game up again.

Progress and development in N64 emulation over the past decade

State of HLE emulation

IMHO, this release today represents one of the biggest steps that have been taken so far to elevate Nintendo 64 emulation as a whole. N64 emulation has gotten a bad rep for over decades because of HLE RDP renderers that fail to accurately reproduce every game’s graphics correctly and tons of unemulated RSP microcode, but it’s gotten significantly better over the years. On the HLE front, things have progressed. GLideN64 has made big strides in emulating most of the major significant games, the HLE RSP implementation used by Mupen 64 Plus is starting to emulate most of the major micro codes that developers made for N64 games. So on that front, things have certainly improved. There are also obviously limiting factors on the HLE front. For instance, GLideN64 still requires OpenGL, and renderers for Vulkan and other modern graphics APIs have not been implemented as of this date (although they could be).

State of LLE emulation

So that’s the HLE front. But for the purpose of this blog article, we are mostly concerned here about Low-Level Emulation. Both HLE and LLE N64 emulation are valid approaches, but if we want to reproduce the N64 accurately, we ultimately have to go LLE. So, what is the state of LLE emulation?

For LLE emulation, some of the advancements over the past few years has been a multithreaded version of Angrylion. Angrylion is the most accurate software RDP renderer to date. Its main problem has always been how slow it is. Up until say the mid to late ’10s, desktop PCs just did not have the CPU power to run any game at fullspeed with this renderer. Multithreaded Angrylion has seen Angrylion make some big gains in the performance department previously thought unimaginable.

However, Angrylion as a software renderer can only be taken so far. The fact remains that it is a big bottleneck on the CPU, and you can easily see CPU activity exceeding over 65% on a modern rig with the multithreaded Angrylion renderer. Software rendering is just never going to be a particularly fast way of doing 3D rasterization.

So, back in 2016, the first attempt at making a hardware renderer that can compete with Angrylion was made. It was a big release for us and it marked one of the first pieces of software to be released that was designed exclusively around the then-new Vulkan graphics API. You can read our old blog post here.

It was a valiant first attempt at making a speedy Angrylion port to hardware. Unfortunately, this first version was full of bugs, and it had some big architectural issues that just made further development on it very hard. So it didn’t see much further development for the past few years.

This year, all the stars have aligned. First out of the gates was the resurrection of paraLLEl-RSP, another project by Themaister. Low-level N64 emulation places a big demand on the CPU, and while cxd4’s RSP interpreter is very accurate, to get at least a 2x leap in performance, a dynamic recompiler approach has to be taken. To that end, this year not only was paraLLEl-RSP resurrected, but we moved the dynamic recompiler architecture from LLVM to Lightrec. It’s a bit less performant than LLVM to be sure but it also has some big advantages – LLVM runtime libraries are very hard to embed and integrate for various platforms, while Lightrec doesn’t have these dependency issues. Furthermore, LLVM would take a long time recompiling code blocks, and it would cause big stutters during gameplay (for instance, bringing up the map in Doom 64 for the first time would cause like a 5-second freeze in the gameplay while it was recompiling a code block – obviously not ideal). With Lightrec, all those stutters were more or less gone.

So, Q1 2020. We now have multithreaded Angrylion which leverages the multi-core CPUs of today’s hardware to get better performance results. We have ParaLLEl RSP, a low-level RSP plugin with a dynamic recompiler that gives us a big bump in performance. But one piece of the puzzle is still missing, and it’s perhaps the most significant. Multithreaded Angrylion still is a software renderer and therefore it still massively bottlenecks the CPU. Whether you can spread that load out over multiple cores or not ultimately matters little – CPUs just are not good at doing fast 3D rasterization, a lesson learned by nearly every mid ’90s PC game developer, and why 3D accelerated hardware could not have come sooner.

So, the obvious Next Big Thing in N64 emulation was to get rid of this CPU bottleneck and move Angrylion kicking and screaming to the GPU, and this time avoid all of the issues that plagued the initial paraLLEl RDP prototype.

Where does that leave us?

With a very accurate Angrylion-quality LLE RDP renderer running on the GPU, and a dynarec LLE RSP core, you will be surprised at how accurate Mupen 64 Plus is now. Nearly every commercial game runs now as expected with nearly no graphical issues, the sound is as you’d expect it to be, it looks, runs and functions just like a real N64. And if you’re on a discrete Nvidia or AMD GPU, your GPU activity will be 4% on average, whether it’s a stone-age GPU from the year 2013 like an AMD R9 290x, or an Nvidia Geforce 2080 Ti. Nearly any discrete GPU made from 2013 to 2020 that supports the Vulkan API seems to eat low-level N64 graphics for breakfast. CPU activity also has decreased significantly. With multithreaded Angrylion and Parallel RSP, there would be about 68% CPU activity on my rig. This is brought down to just 7 to 10% using paraLLEl RDP instead of Angrylion. Software rendering on the CPU is just a huge bottleneck no matter which way you slice it.

So for most practical purposes, using the paraLLEl RDP and paraLLEl RSP cores in tandem, the future is now. Accurate N64 emulation is here, it’s no longer slow, and it’s no longer completely CPU bound either. And you can play it on RetroArch right now, right today. We don’t have to wait for a near-accurate representation of an N64, it’s already here with us for all practical gameplay purposes.

How much faster is paraLLEl RDP compared to Angrylion? That is hard to say, and depends on the game you’re running. On average you can expect a 2x speedup. However, notice that at native resolution rendering, any discrete GPU since 2013 eats this workload for breakfast. This means you’re completely CPU bound in terms of performance most of the time. The better your CPU is at single threaded workloads (IPC), the better it will perform. Core count is a less significant factor. I think on my specific rig, it was my CPU that was the weakest link in the chain (a 7700k i7 Intel CPU paired with a 2080 Ti). The GPU matters relatively little, the 2080 Ti was mostly being completely idle during these tests. For that matter, so was an old 2013 AMD card that I would test with the same CPU – GPU activity remained flat at around 4%. As Themaister has indicated in his blog post, this leaves so much room for upscaled resolutions, which is on the roadmap for future versions.

Benchmarks

System specs: CPU – Intel Core i7 7700k | GPU – Geforce RTX 2080 Ti (11GB VRAM, 2018) | 16GB RAM

Title	Angrylion	ParaLLEl RDP (Synchronous)	ParaLLEl RDP (Asynchronous)
007 GoldenEye	82fps	119fps	133fps
Banjo Tooie	72fps	132fps	148fps
Doom 64	174fps	282fps	322fps
F-Zero X	158fps	370fps	478fps
Hexen	156fps	300fps	360fps
Indiana Jones and the Infernal Machine	61fps	94fps	114fps
Killer Instinct Gold	~103fps	~168fps	~240fps
Legend of Zelda: Majora’s Mask	122fps	202fps	220fps
Mario Kart 64	~178fps	~309fps	~330-350fps
Perfect Dark (High-res)	70fps	125fps	130fps
Pilotwings 64	87fps	125fps	144fps
Quake	188fps	262fps	300fps
Resident Evil 2	183fps	226fps	383fps (*)
Star Wars Episode I: Battle for Naboo	90fps	136fps	178fps
Super Mario 64	129fps	204fps	220fps
Vigilante 8 (Low-res)	63fps	91fps	112fps
Vigilante 8 (High-res)	~46-55fps	~92-99fps	~119fps
World Driver Championship	~109fps	~225fps	~257fps

* – Has game breaking issues in this mode

System specs: CPU – Intel Core i7 7700k | GPU – AMD Radeon R9 290x (4GB VRAM, 2013) | 16GB RAM

Title	Angrylion	ParaLLEl RDP (Synchronous)	ParaLLEl RDP (Asynchronous)
007 GoldenEye	82fps	119fps	133fps
Banjo Tooie	72fps	132fps	148fps
Doom 64	174fps	282fps	322fps
F-Zero X	158fps	360fps	439fps
Hexen	156fps	288fps	352fps
Indiana Jones and the Infernal Machine	61fps	94fps	114fps
Killer Instinct Gold	~93fps	~162fps	~239fps
Legend of Zelda: Majora’s Mask	122fps	202fps	220fps
Mario Kart 64	~157fps	~274fps	~292fps
Perfect Dark (High-res)	70fps	125fps	130fps
Pilotwings 64	87fps	125fps	144fps
Quake	189fps	262fps	326fps
Resident Evil 2	156fps	226fps	383fps (*)
Star Wars Episode I: Battle for Naboo	90fps	136fps	178fps
Super Mario 64	129fps	195fps	209fps
Vigilante 8 (Low-res)	63fps	91fps	112fps
Vigilante 8 (High-res)	~46-55fps	~92-99fps	~119fps
World Driver Championship	~109fps	~224fps	~257fps

* – Has game breaking issues in this mode

Core option explanations

paraLLEl RDP has some special dedicated options. You can change these by going to Quick Menu and going to Options. Here’s a quick breakdown of what they do –

ParaLLEl Synchronous RDP:

Turning this off allows for higher CPU/GPU parallelism. However, there are certain games that might produce problems if left disabled. An example of such a game is Resident Evil 2.

It has been verified that with the vast majority of games, disabling this can provide for at least a +10fps speedup. Usually the performance difference is much higher though. Try experimenting with it. If you experience no game breaking bugs or visual anomalies, it’s safe to disable this for the game you’re running and enjoy higher performance.

Video Interface Options
ParaLLEl-RDP emulates the N64 RDP’s VI module. This applied plenty of postprocessing to the final output image to further smooth out the picture. Some of the options down below allow you to enable/disable some of these VI settings on the fly. Disabling some of these and enabling some others could be beneficial if you want to use several frontend shaders on top, since disabling some of these postprocessing effects could result in a radically different output image.

(ParaLLEl-RDP) VI Interlacing Disabling this will disable the VI serration bits used for interlaced video modes. Turning this off essentially looks like basic bob deinterlacing, the picture might become shaky as a result when leaving this off.

(ParaLLEl-RDP) VI Gamma Filter Disabling this will disable the hardware gamma filter that some games use.

(ParaLLEl-RDP) VI Divot filter Disabling this will disable the median filter which is intended to clean up some glitched pixels coming out of the RDP. Subtle difference in output, but usually seems to apply to shadow blob decals.

(ParaLLEl-RDP) VI AA Disabling this will disable Anti-Aliasing.

(ParaLLEl-RDP) VI Dither Filter The VI’s dither filter is used to make color banding less apparent with 16-bit pixels.

(ParaLLEl-RDP) VI Bilinear VI bilinear is the internal upscaler in the VI. Disabling this is typically a good idea, since it’s typically used to upscale horizontally.

By disabling VI AA and enabling VI Bilinear, the picture output looks just like how Angrylion’s “Unfiltered” mode currently looks like.

FAQ

Will this renderer be ported to OpenGL?

Here is the short answer – no. Not by us, at least. Reasons: OpenGL is an outdated API compared to Vulkan that does not support the features required by Parallel-RDP. GL does not support 8/16bit storage, external memory host, or async compute. If one would be able to make it work, it would only work on the very best GL implementation, where Vulkan is supported anyways, rendering it mostly moot.

Ports to DirectX 12 are similarly not going to be considered by us, others can feel free to do so. One word of warning – even DirectX12 (yes, even Ultimate) is found lacking when it comes to providing the graphics techniques that ParaLLEl RDP is built around. Whoever will take on the endeavor to port this to DX12 or GL 4.5/4.6 will have their work cut out for them.