paraLLEl N64 – Low-level RDP upscaling is finally here!

ParaLLEl RDP this year has singlehandedly caused a breakthrough in N64 emulation. For the first time, the very CPU-intensive accurate Angrylion renderer was lifted from CPU to GPU thanks to the powerful low-level graphics API Vulkan. This combined with a dynarec-powered RSP plugin has made low-level N64 emulation finally possible for the masses at great speeds on modest hardware configurations.

ParaLLEl RDP Upscaling

Jet Force Gemini running with 2x internal upscale
Jet Force Gemini running with 2x internal upscale

It quickly became apparent after launching ParaLLEl RDP that users have grown accustomed to seeing upscaled N64 graphics over the past 20 years. So something rendering at native resolution, while obviously accurate, bit-exact and all, was seen as unpalatable to them. Many users indicated over the past few weeks that upscaling was desired.

Well, now it’s here. ParaLLEl RDP is the world’s first Low-Level RDP renderer capable of upscaling. The graphics output you get is unlike any HLE renderer you’ve ever seen before for the past twenty years, since unlike them, there is full VI emulation (including dithering, divot filtering, and basic edge anti-aliasing). You can upscale in integer steps of the base resolution. When you set resolution upscaling to 2x, you are multiplying the input resolution by 2x. So 256×224 would become 512×448, 4x would be 1024×896, and 8x would be 2048×1792.

Now, here comes the good stuff with LLE RDP emulation. As said before, unlike so many HLE renderers, ParaLLEl RDP fully emulates the RCP’s VI Interface. As part of this interface’s postprocessing routines, it automatically applies an approximation of 8x MSAA (Multi-Sampled Anti-Aliasing) to the image. This means that even though our internal resolution might be 1024×896, this will then be further smoothed out by this aggressive AA postprocessing step.

Super Mario 64 running on ParaLLEl RDP with 2x internal upscale
Super Mario 64 running on ParaLLEl RDP with 2x internal upscale

This results in even games that run at just 2x native resolution looking significantly better than the same resolution running on an HLE RDP renderer. Look for instance at this Mario 64 screenshot here with the game running at 2x internal upscale (512×448).

How to install and set it up

RDP upscaling is available right now on Windows, Linux, and Android. We make no guarantees as to what kind of performance you can expect across these platforms, this is all contingent on your GPU’s Vulkan drivers and its compute power.

Anyway, here is how you can get it.

  • In RetroArch, go to Online Updater.
  • (If you have paraLLEl N64 already installed) – Select ‘Update Installed Cores’. This will update all the cores that you already installed.
  • (If you don’t have paraLLEl N64 installed already) – go to ‘Core Updater’ (older versions of RA) or ‘Core Downloader’ (newer version of RA), and select ‘Nintendo – Nintendo 64 (paraLLEl N64)’.
  • Now start up a game with this core.
  • Go to the Quick Menu and go to ‘Options’. Scroll down the list until you reach ‘GFX Plugin’. Set this to ‘parallel’. Set ‘RSP plugin’ to ‘parallel’ as well.
  • For the changes to take effect, we now need to restart the core. You can either close the game or quit RetroArch and start the game up again.

In order to upscale, you need to first set the Upscaling factor. By default, it is set to 1x (native resolution). Setting it to 2x/4x/8x then restarting the core makes the upscaling take effect.

Explanation of core options

A few new core option features have been added. We’ll briefly explain what they do and how you can go about using them.

  • (ParaLLEl-RDP) Upscaling factor (Restart)

Available options: 1x, 2x, 4x, 8x

The upscaling factor for the internal resolution. 1x is default and is the native resolution. 2x, 4x, and 8x are all possible. NOTE: It bears noting that 8x requires at least 5GB/6GB VRAM on your GPU. System requirements are steep for 8x and we generally don’t recommend anything less than a 1080 Ti or better for this. Your mileage may vary, just be forewarned. 2x and 4x by comparison are much lighter. Even when upscaling, the rendering is still rendering at full accuracy, and it is still all software rendered on the GPU. 4x upscale means 16x times the work that 1x Angrylion would churn through.

  • (paraLLEl-RDP) Downsampling

Available options: Disabled, 1/2, 1/4, 1/8

Also known as SSAA, this works pretty similar to the SSAA downscaling feature in Beetle PSX HW’s Vulkan renderer. The idea is that you internally upscale at a higher resolution, then set this option from ‘Disabled’ to any of the other values. What happens from there is that this internal higher resolution image is then downscaled to either half its size, one quarter of its size, or one eight of its size. This gives you a very smoothed out anti-aliased picture that for all intents and purposes still outputs at 240p/240i. From there, you can apply some frontend shaders on top to create a very nice and compelling look that still looks better than native resolution but is also still very faithful to it.

So, if you would want 4x resolution upscaling with 4x SSAA, you’d set ‘Downsample’ to ‘1/2’. With 4x upscale, and 1/4 downsample, you get 240p output with 16x SSAA, which looks great with CRT shaders.

  • (paraLLEl-RDP) Use native texture LOD when upscaling

This option is disabled by default.

We have so far only found one game that absolutely required this to be turned on for gameplay purposes. If you don’t have this enabled, the Princess-to-Bowser painting transition in Mario 64 is not there and instead you just see Bowser in the portrait from a far distance. There might be further improvements later to attempt to automatically detect these cases.

Most N64 games didn’t use mipmapping, but the ones that do on average benefit from this setting being off – you get higher quality LOD textures instead of a lower-quality LOD texture eventually making way for a more detailed one as you look closer. However, turning this option on could also be desirable depending on whether you favor accurate looking graphics or a facsimile of how things used to look.

  • (paraLLEl-RDP) Use native resolution for TEX_RECT

This option is on by default.

2D elements such as sprites are usually rendered with TEX_RECT commands, and trying to upscale them inevitably leads to ugly “seams” in the picture. This option forces native resolution rendering for such sprites.

 

Managing expectations

It’s important that people understand what the focus of this renderer is. There is no intent to have yet another enhancement-focused renderer here. This is the closest there has ever been to date of a full software rendered reimplementation of Angrylion on the GPU with additional niceties like upscaling. The renderer guarantees bit-exactness, what you see is what you would get on a real N64, no exceptions.

With a HLE renderer, the scene is rendered using either OpenGL or Vulkan rasterization rules. Here, neither is done – the exact rasterization steps of the RDP are followed instead, there are no API calls to GL to draw triangles here or there. So how is this done? Through compute shaders. It’s been established that you cannot correctly emulate the RDP’s rasterization rules by just simply mapping it to OpenGL. This is why previous attempts like z64gl fell flat after an initial promising start.

So the value proposition here for upscaling with ParaLLEl RDP is quite compelling – you get upscaling with the most accurate renderer this side of Angrylion. It runs well thanks to Vulkan, you can upscale all the way to 8x (which is an insane workload for a GPU done this way). And purists get the added satisfaction of seeing for the first time upscaled N64 graphics using the N64’s entire postprocessing pipeline finally in action courtesy of the VI Interface. You get nice dither filtering that smooths out really well at higher resolutions and can really fake the illusion of higher bit depth. HLE renderers have a lot of trouble with the kind of depth cuing and dithering being applied on the geometry, but ParaLLEl RDP does this effortlessly. This causes the upscaled graphics to look less sterile, whereas with traditional GL/Vulkan rasterization, you’d just see the same repeated textures everywhere with the same basic opacity everywhere. Here, we get dithering and divot filtering creating additional noise to the image leading to an overall richer picture.

So basically, the aim here is actually emulating the RDP and RSP. The focus is not on getting the majority of commercial games to just run and simulating the output they would generate through higher level API calls.

Won’t be done – where HLE wins

Therefore, the following requests will not be pursued at least in the near future:

* Widescreen rendering – Can be done through game patches (ASM patches applied directly to the ROM, or bps/ups patches or something similar). Has to be done on a per-game basis, with HLE there is some way to modify the view frustum and viewport dimensions to do this but it almost never works right due to the way the game occludes geometry and objects based on your view distance, so game patches implementing widescreen and DOF/draw distance enhancements would always be preferable.

So, in short, yes, you can do this with ParaLLEl RDP too, just with per-game specific patches. Don’t expect a core option that you can just toggle on or off.
* Rendering framebuffer effects at higher resolution – not really possible with LLE, don’t see much payoff to it either. Super-sampled framebuffer effects might be possible in theory.
* Texture resolution packs – Again, no. The nature of an LLE renderer is right there in the name, Low-Level. While the RDP is processing streams of data (fed to it by the RSP), there is barely any notion whatsoever of a ‘texture’ – it only sees TMEM uploads and tile descriptors which point to raw bytes. With High Level emulation, you have a higher abstraction level where you can ‘hook’ into the parts where you think a texture upload might be going on so you can replace it on the fly. Anyway, those looking for something like that are really at the wrong address with ParaLLEl RDP anyway. ParaLLEl RDP is about making authentic N64 rendering look as good as possible without resorting to replacing original assets or anything bootleg like that.
* Z-fighting/subpixel precision: In some games, there is some slight Z-fighting in the distance that you might see which HLE renderers typically don’t have. Again, this is because this is accurate RDP emulation. Z-fighting is a thing. The RDP only has 18-bit UNORM of depth precision with 10 bits of fractional precision during interpolation, and compression on top of that to squeeze it down to 14 bits. A HLE emulator can render at 24+ bits depth. Compounding this, because the RSP is Low-level, it’s sending 16-bit fixed point vertex coordinates to the RDP for rendering. A typical HLE renderer and HLE RSP would just determine that we are about to draw some 3D geometry and then just turn it into float values so that there is a higher level of precision when it comes to vertex positioning. If you recall, the PlayStation1’s GTE also did not deal with vertex coordinates in floats but in fixed point. There, we had to go to the effort of doing PGXP in order to convert it to float. I really doubt there is any interest to contemplate this at this point. Best to let sleeping dogs lie.
* Raw performance. HLE uses the hardware rasterization and texture units of the GPU which is far more efficient than software, but of course, it is far less accurate than software rendering.

Where LLE wins

Conversely, there are parts where LLE wins over HLE, and where HLE can’t really go –

* HLE tends to struggle with decals, depth bias doesn’t really emulate the RDP’s depth bias scheme at all since RDP depth bias is a double sided test. Depth bias is also notorious for behaving differently on different GPUs.
* Correct dithering. A HLE renderer still has to work with a fixed function blending pipeline. A software rendered rasterizer like ParaLLEl RDP does not have to work with any pre-existing graphics API setup, it implements its own rasterizer and outputs that to the screen through compute shading. Correct dither means applying dither after blending, among other things, which is not something you can generally do [with HLE]. It generally looks somewhat tacky to do dithering in OpenGL. You need 8-bit input and do blending in 8-bit but dither + quantization at the end, which you can’t do in fixed function blending.
* The entire VI postprocessing pipeline. Again, it bears repeating that not only is the RDP graphics being upscaled, so is the VI filtering. VI filtering got a bad rep on the N64 because this overaggressive quasi-8x MSAA would tend to make the already low-resolution images look even blurrier. But at higher resolutions as you can see here, it can really shine. You need programmable blending to emulate the VI’s coverage, and this just is not practical with OpenGL and/or the current HLE renderers out there. The VI has a quite ingenious filter that distributes the dither noise where it reconstructs more color depth. So not only are we getting post-processing AA courtesy of the RDP, we’re also getting more color depth.
* What You See Is What You Get. This renderer is the hypothetical what-if scenario of how an N64 Pro unit would look like that could pump out insane resolutions while still having the very same hardware. Notice that the RDP/VI implementations in ParaLLEl RDP have NOT been enhanced in any way. The only real change was modifying the rasterizer to test fractional pixel coordinates as well.
* Full accuracy with CPU readbacks. CPU can freely read and write on top of RDP rendered data, and we can easily deal with it without extra hacks.

Known issues

  • The deinterlacing process for interlaced video modes is still rather poor (just like Angrylion), basic bob and weave setup. There are plans to come up with a completely new system.
  • Mario Tennis glitches out a bit with upscaling for some reason, there might be subtle bugs in the implementation that only manifest on that game. This seems to not happen on Nvidia Windows drivers though.

Screenshots

The screenshots below here show ParaLLEl RDP running at its maximum internal input resolution, 8x the original native image. This means that when your game is running at say 256×224, it would be running at 2048×1792. But if your game is running at say 640×480 (some interlaced games actually set the resolution that high, Indiana Jones IIRC), then we’d be looking at 5120×3840. That’s bigger than 4K! Then bear in mind that on top of that you’re going to get the VI’s 8x MSAA on top of that, and you can probably begin to imagine just how demanding this is on your GPU given that it’s trying to run a custom software rasterizer on hardware. Suffice it to say, the demands for 2x and 4x will probably not be too steep, but if you’re thinking of using 8x, you better bring some serious GPU horsepower. You’ll need at least 5-6GB of VRAM for 8x internal resolution for starters.

Anyway, without much further ado, here are some glorious screenshots. GoldenEye 007 now looks dangerously close to the upscaled bullshot images on the back of the boxart!

GoldenEye 007 running with ParaLLEl RDP at 8x internal upscale
GoldenEye 007 running with ParaLLEl RDP at 8x internal upscale
Super Mario 64 running on ParaLLEl RDP with 8x internal upscale
Super Mario 64 running on ParaLLEl RDP with 8x internal upscale
Star Fox 64 running on ParaLLEl RDP with 8x internal upscale
Star Fox 64 running on ParaLLEl RDP with 8x internal upscale
Perfect Dark running on ParaLLEl RDP with 8x internal upscale in high-res mode
Perfect Dark running on ParaLLEl RDP with 8x internal upscale in high-res mode
World Driver Championship running on ParaLLEl RDP with 8x internal upscale
World Driver Championship running on ParaLLEl RDP with 8x internal upscale

Videos

Body Harvest

Perfect Dark

Legend of Zelda: Ocarina of Time

Super Mario 64

Coming to Mupen64Plus Next soon

ParaLLEl RDP will also be making its way into the upcoming new version of Mupen64Plus Next as well. Expect increased compatibility over ParaLLEl N64 (especially on Android) and potentially better performance in many games.

Future blog posts

There might eventually be some future blog post by Themaister going into more technical detail on the inner workings of ParaLLEl RDP. I will also probably release a performance test-focused blog post later testing a variety of different GPUs and how far we can take them as far as upscaling is concerned.

I can already tell you to neuter your expectations with regards to Android/mobile GPUs. I tested ParaLLEl RDP with 2x upscaling on a Samsung Galaxy S10+ and performance was about 36fps, this is with vsync off. With 1x native resolution I manage to get on average 64 to 70fps with the same games. So obviously mobile GPUs still have a lot of catching up to do with their discrete big brothers on the desktop.

At least it will make for a nice GPU benchmark for mobile hardware until we eventually crack fullspeed with 2x native!

Reviving and rewriting paraLLEl-RDP – Fast and accurate low-level N64 RDP emulation

Over the last few months after completing the paraLLEl-RSP rewrite to a Lightrec based recompiler, I’ve been plugging away on a project which I had been putting off for years, to implement the N64 RDP with Vulkan compute shaders in a low-level fashion. Every design of the old implementation has been scrapped, and a new implementation has arisen from the ashes. I’ve learned a lot of advanced compute techniques, and I’m able to use far better methods than I was ever able to use back in the early days. This time, I wanted to do it right. Writing a good, accurate software renderer on a massively parallel architecture is not easy and you need to rethink everything. Serial C code will get you nowhere on a GPU, but it’s a fun puzzle, and quite rewarding when stuff works.

The new implementation is a standalone repository that could be integrated into any emulator given the effort: https://github.com/Themaister/parallel-rdp. For this first release, I integrated it into parallel-n64. It is licensed as MIT, so feel free to integrate it in other emulators as well.

Why?

I wanted to prove to myself that I could, and it’s … a little fun? I won’t claim this is more than it is. 🙂

Chasing bit-exactness

The new implementation is implemented in a test-driven way. The Angrylion renderer is used as a reference, and the goal is to generate the exact same output in the new renderer. I started writing an RDP conformance suite. Here, we generate RDP commands in C++, run the commands across different implementations, and compare results in RDRAM (and hidden RDRAM of course, 9-bit RAM is no joke). To pass, we must get an exact match. This is all fixed-point arithmetic, no room for error! I’ve basically just been studying Angrylion to understand what on earth is supposed to happen, and trying to make sense of what the higher level goal of everything is. In LLE, there’s a lot of weird magic that just happens to work out.

I’m quite happy with where I’ve ended up with testing and seeing output like this gives me a small dopamine shot before committing:

122/163 Test #122: rdp-test-interpolation-color-texture-ci4-tlut-ia16 ………………………….. Passed 2.50 sec
Start 123: rdp-test-interpolation-color-texture-ci8-tlut-ia16
123/163 Test #123: rdp-test-interpolation-color-texture-ci8-tlut-ia16 ………………………….. Passed 2.40 sec
Start 124: rdp-test-interpolation-color-texture-ci16-tlut-ia16
124/163 Test #124: rdp-test-interpolation-color-texture-ci16-tlut-ia16 …………………………. Passed 2.37 sec
Start 125: rdp-test-interpolation-color-texture-ci32-tlut-ia16
125/163 Test #125: rdp-test-interpolation-color-texture-ci32-tlut-ia16 …………………………. Passed 2.45 sec
Start 126: rdp-test-interpolation-color-texture-2cycle-lod-frac
126/163 Test #126: rdp-test-interpolation-color-texture-2cycle-lod-frac ………………………… Passed 2.51 sec
Start 127: rdp-test-interpolation-color-texture-perspective
127/163 Test #127: rdp-test-interpolation-color-texture-perspective ……………………………. Passed 2.50 sec
Start 128: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac
128/163 Test #128: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac ……………… Passed 3.29 sec
Start 129: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac-sharpen
129/163 Test #129: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac-sharpen ………. Passed 3.26 sec
Start 130: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac-detail
130/163 Test #130: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac-detail ……….. Passed 3.48 sec
Start 131: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac-sharpen-detail
131/163 Test #131: rdp-test-interpolation-color-texture-perspective-2cycle-lod-frac-sharpen-detail … Passed 3.26 sec
Start 132: rdp-test-texture-load-tile-16-yuv

151/163 Test #151: vi-test-aa-none …………………………………………………………. Passed 21.19 sec
Start 152: vi-test-aa-extra-dither-filter
152/163 Test #152: vi-test-aa-extra-dither-filter ……………………………………………. Passed 48.77 sec
Start 153: vi-test-aa-extra-divot
153/163 Test #153: vi-test-aa-extra-divot …………………………………………………… Passed 64.29 sec
Start 154: vi-test-aa-extra-dither-filter-divot
154/163 Test #154: vi-test-aa-extra-dither-filter-divot ………………………………………. Passed 65.90 sec
Start 155: vi-test-aa-extra-gamma
155/163 Test #155: vi-test-aa-extra-gamma …………………………………………………… Passed 48.28 sec
Start 156: vi-test-aa-extra-gamma-dither
156/163 Test #156: vi-test-aa-extra-gamma-dither …………………………………………….. Passed 48.18 sec
Start 157: vi-test-aa-extra-nogamma-dither
157/163 Test #157: vi-test-aa-extra-nogamma-dither …………………………………………… Passed 47.56 sec

100% tests passed, 0 tests failed out of 163 #feelsgoodman

Ideally, if someone is clever enough to hook up a serial connection to the N64, it might be possible to run these tests through a real N64, that would be interesting.

I also fully implemented the VI this time around. It passes bit-exact output with Angrylion in my tests and there is a VI conformance suite to validate this as well. I implemented almost the entire thing without even running actual content. Once I got to test real content and sort out the last weird bugs, we get to the next important part of a test-driven development workflow …

The importance of dumping formats

A critical aspect of verifying behavior is being able to dump RDP commands from the emulator and replay them.

On the left I have Angrylion and on the right paraLLEl-RDP running side by side from a dump where I can step draw by draw, and drill down any pesky bugs quite effectively. This humble tool has been invaluable. The Angrylion backend in parallel-n64 can be configured to generate dumps which are then used to drill down rendering bugs offline.

Compatibility

The compatibility is much improved and should be quite high, I won’t claim its perfect, but I’m quite happy with it so far. We went through essentially all relevant titles during testing (just the first few minutes), and found and fixed the few issues which popped up. Many games which were completely broken in the old implementation now work just fine. I’m fairly confident that those bugs are solvable this time around though if/when they show up.

Implementation techniques

With Vulkan in 2020 I have some more tools in my belt than was available back in the day. Vulkan is a quite capable compute API now.

Enforcing RDRAM coherency

A major pain point of any N64 emulator is the fact that RDRAM is shared for the CPU and RDP, and games sure know how to take advantage of this. This creates a huge burden on GPU-accelerated implementations as we now have to ensure full coherency to make it accurate. Most HLE emulators simply don’t care or employ complicated heuristics and workarounds, and that’s fine, but it’s not good enough for LLE.

In the previous implementation, it would try to do “framebuffer manager” techniques similar to HLE emulators, but this was the wrong approach and lead to a design which was impossible to fix. What if … we just import RDRAM as buffer straight into the Vulkan driver and render to that, wouldn’t that be awesome? Yes … yes, it would be, and that’s what I did. We have an obscure, but amazing extension in Vulkan called VK_EXT_external_memory_host which lets me import RDRAM from the emulator straight into Vulkan and render to it over the PCI-e bus. That way, all framebuffer management woes simply disappear, I render straight into RDRAM, and the only thing left to do is to handle synchronization. If you’re worried about rendering over the PCI-e bus, then don’t be. The bandwidth required to write out a 320×240 framebuffer is absolutely trivial especially considering that we’re doing …

Tile-based rendering

The last implementation was tile-based as well, but the design is much improved. This time around all tile binning is done entirely on the GPU in parallel, using techniques I implemented in https://github.com/Themaister/RetroWarp, which was the precursor project for this new paraLLEl-RDP. Using tile-based rendering, it does not really matter that we’re effectively rendering over the PCI-e bus as tile-based rendering is extremely good at minimizing external memory bandwidth. Of course, for iGPU, there is no (?) external PCI-e bus to fight with to begin with, so that’s nice!

Ubershaders with asynchronous pipeline optimization

The entire renderer is split into a very small selection of Vulkan GLSL shaders which are precompiled into SPIR-V. This time, I take full advantage of Vulkan specialization constants which allow me to fine-tune the shader for specific RDP state. This turned out to be an absolute massive win for performance. To avoid the dreaded shader compilation stutter, I can always fallback to a generic ubershader while pipeline is being compiled which is slow, but works for any combination of state. This is a very similar idea to what Dolphin pioneered for emulation a few years ago.

8/16-bit integer support

Memory accesses in the RDP are often 8 or 16 bits, and thus it is absolutely critical that we make use of 8/16-bit storage features to interact directly with RDRAM, and if the GPU supports it, we can make use of 8 and 16-bit arithmetic as well for good measure.

Async compute

Async compute is critical as well, since we can make the async compute queue high priority and ensure that RDP shading work happens with very low latency, while VI filtering and frontend shaders can happily chug along in the fragment/graphics queue. Both AMD and NVIDIA now have competent implementations here.

GPU-driven TMEM management

A big mistake I made previously was doing TMEM management in CPU timeline, this all came crashing down once we needed framebuffer effects. To avoid this, all TMEM uploads are now driven by the GPU. This is probably the hairiest part of paraLLEl-RDP by far, but I have quite a lot of gnarly tests to test all the relevant corner cases. There are some true insane edge cases that I cannot handle yet, but the results created would be completely meaningless to any actual content.

Performance

To talk about FPS figures it’s important to consider the three major performance hogs in a low-level N64 emulator, the VR4300 CPU, the RSP and finally the RDP. Emulating the RSP in an LLE fashion is still somewhat taxing, even with a dynarec (paraLLEl-RSP) and even if I make the RDP infinitely fast, there is an upper bound to how fast we can make the emulator run as the CPU and RSP are still completely single threaded affairs. Do keep that in mind. Still, even with multithreaded Angrylion, the RDP represents a quite healthy chunk of overhead that we can almost entirely remove with a GPU implementation.

GPU bound performance

It’s useful to look at what performance we’re getting if emulation was no constraint at all. By adding PARALLEL_RDP_BENCH=1 to environment variables, I can look at how much time is spent on GPU rendering.

Playing on an GTX 166o Ti outside the castle in Mario 64:

[INFO]: Timestamp tag report: render-pass
[INFO]: 0.196 ms / frame context
[INFO]: 0.500 iterations / frame context

We’re talking ~0.2ms on GPU to render one frame on average, hello theoretical 5000 VI/s … Somewhat smaller frame times can be observed on my Radeon 5700 XT, but we’re getting frame rates so ridiciously high they become meaningless here. We’ve tested it on quite old cards as well and the difference in FPS on even something ancient like an R9 290x card and a 2080 Ti is minimal since the time now spent in RDP rendering is completely irrelevant compared to CPU + RSP workloads. We seem to be getting about a 50-100% uplift in FPS, which represents the shaved away overhead that the CPU renderer had. Hello 300+ VI/s!

Unfortunately, Intel iGPU does not fare as well, with an overhead high enough that it does not generally beat multithreaded Angrylion running on CPU. I was somewhat disappointed by this, but I have not gone into any real shader optimization work. My early analysis suggests extremely poor occupancy and a ton of register spilling. I want to create a benchmark tool at some point to help drill down these issues down the line.

It would be interesting to test on the AMD APUs, but none of us have the hardware handy sadly 🙁

Synchronous vs Asynchronous RDP

There are two modes for the RDP. In async mode, the emulation thread does not wait for the GPU to complete rendering. This improves performance, at the cost of accuracy. Many games unfortunately really rely on the unified memory architecture of the N64. The default option is sync, and should be used unless you have a real need for speed, or the game in question does not need sync.

Here we see an example of broken blob shadows caused by async RDP in Jet Force Gemini. This happens because the CPU is actually reading the shadowmap rendered by the RDP, and blurring it on the CPU timeline (why on earth the game would do that is another question), then reuploading it to the RDP. These kinds of effects require very tight sync between CPU and GPU and comes up in many games. N64 is particularly notorious for these kinds of rendering challenges.

Of course, given how fast the GPU implementation is on discrete GPUs, sync mode does not really pose an issue. Do note that since we’re using async compute queues here, we are not stalling on frontend shading or anything like that. The typical stall times on the CPU is in the order of 1 ms per frame, which is very acceptable. That includes the render thread doing its thing, submitting that to GPU, getting it executed and coming back to CPU, which has some extra overhead.

Road-map for future improvement

I believe this is solid enough for a first release, but there are further avenues for improvement.

Figure out poor performance on Intel iGPU

There is something going on here that we should be able to improve.

Implement a workaround for implementations without VK_EXT_external_memory_host (EDIT: Now implemented as of 2020-05-18)

Unfortunately there is one particular driver on desktop which doesn’t support this, and that’s NVIDIA on Linux (Windows has been supported since 2018 …). Hopefully this gets implemented soon, but we will need a fallback. This will get ugly since we’ll need to start shuffling memory back and forth between RDRAM and a GPU buffer. Hopefully the async transfer queue can help make this less painful. It might also open up some opportunities for mobile, which also don’t implement this extension as we speak. There might also be incentives to rewrite some fundamental assumptions in the N64 emulator plugin specifications (can we please get rid of this crap …). If we can let the GPU backend allocate memory, we don’t need any fancy extension, but that means uprooting 20 years of assumptions and poking into the abyss … Perhaps a new implementation can break new ground here (hi @ares_emu!).

EDIT: This is now done! Takes a 5-10% performance hit in sync mode, but the workaround works quite well. A fine blend of masked SIMD moves, a writemask buffer, and atomics …

Internal upscaling?

It is rather counter-intuitive to do upscaling in an LLE emulator, but it might yield some very interesting results. Given how obscenely fast the discrete GPUs are at this task, we should be able to do a 2x or maybe even 4x upscale at way-faster-than-realtime speeds. It would be interesting to explore if this lets us avoid the worst artifacts commonly associated with upscaling in HLE.

Fancier deinterlacer?

Some N64 content runs at 480i, and we can probably spare some GPU cycles running a fancier deinterlacer 😉

Esoteric use cases?

PS1 wobbly polygon rendering has seen some kind of resurgence in the last years in the indie scene, perhaps we’ll see the same for the fuzzy N64 look eventually. With paraLLEl-RDP, it should be possible to build a rendering engine around a N64-lookalike game. That would be cool to see.

Conclusion

This is a somewhat esoteric implementation, but I hope I’ve inspired more implementations like this. Compute-based accurate renderers will hopefully spread to more systems that have difficulties with accurate rendering. I think it’s a very interesting topic, and it’s a fun take on emulation that is not well explored in general.

Parallel N64 with Multithreaded Angrylion released!

We originally intended to release this together with the new RetroArch version right before the end of this month. However, we want to take a few more days to ensure that the release of RetroArch 1.6.8 is solid and that we don’t rush it out of the gates in a premature state. We ask for your patience, it won’t take too long, a couple of days at most. In the meantime, we have the Parallel N64 core with multithreaded Angrylion ready to go!

This is a heavily modified version of ata4‘s Angrylion RDP Plus plugin. It has the following distinctive characteristics so far:

1 – Made a bunch of changes so that performance in Linux/Mingw is not as bad as it was previously (still worse than Windows though).
2 – Does not require OpenGL context 3.2, or OpenGL at all. It is purely a software renderer that can use any output video driver you want in your libretro frontend. So you can use this in conjunction with OpenGL, Direct3D, Vulkan, etc.

Credit goes to mudlord, Brad Parker and AIO for being able to get this done in such short notice. I helped out along the way too.

Available for

  • Linux
  • Windows
  • Android

Where to get it

1. Start RetroArch.
2. Go to Online Updater -> Update Cores.
3. Download ‘Nintendo 64 (Parallel N64)’ from the list.

How to use it

1. Start up the Parallel N64 core with any game.

2. Go to Quick Menu -> Options. Make sure that you set ‘GFX Plugin’ to ‘angrylion’ and ‘RSP Plugin’ to çxd4′. Restart RetroArch.

3. It should now use multithreaded Angrylion as the graphics plugin.

Performance

This scene serves as our benchmark test. Fullspeed framerate has been enabled.
This scene serves as our benchmark test. Fullspeed framerate has been enabled.

For the purpose of this performance test, I am running the game Super Mario 64.

The system on which the tests are being performed is a Core i7 7700k processor with 16GB of RAM running Windows 10 and Linux respectively.

Windows

CPU Core Angrylion version OS Performance (with VI Overlay on) Performance (with VI Overlay off)
Cached interpreter Windows 10 Old Angrylion 52fps 63fps
Dynarec Windows 10 Old Angrylion 52fps 64fps
Dynarec Windows 10 New Angrylion Multithreaded 114fps 123fps
Cached interpreter Windows 10 New Angrylion Multithreaded 106fps 118fps

Linux

CPU Core Angrylion version OS Performance (with VI Overlay on) Performance (with VI Overlay off)
Cached interpreter Linux Old Angrylion 53fps 63fps
Dynarec Linux Old Angrylion 55fps 65fps
Dynarec Linux New Angrylion Multithreaded 72fps 84fps
Cached interpreter Linux New Angrylion Multithreaded 69fps 82fps

macOS

Too slow to be worth bothering with, singlethreaded Angrylion actually turned out faster here. That is why the Mac version will still be using the old Angrylion version.

Videos

Conker’s Bad Fur Day

Banjo Tooie

Biohazard 2/Resident Evil 2

Killer Instinct Gold

Super Mario 64

Sources

https://github.com/libretro/parallel-n64

https://github.com/ata4/angrylion-rdp-plus/commits/master

Performance tips

Some core options have the potential to dramatically improve performance.

Quick Menu -> Options -> Framerate – You can set this to either ‘Original’ or ‘Fullspeed’. Original will attempt to run the game at its original framerate, while Fullspeed bumps it up to 60 V/Is. Note – if you find a game is running below fullspeed on your system, consider setting this to ‘Original’. I know that in Conker’s Bad Fur Day and Pilotwings 64, there is a big performance impact if you set it to ‘Fullspeed’.

Quick Menu -> Options -> VI Overlay – Disabling this can give you a 10 to 20fps speedup at the expense of the VI overlay’s filtering being lost, leading to a more pixelated but less blurry image. Also note that some games may not work properly with VI Overlay off right now, such as Resident Evil 2.

How to improve the graphics

In case you find the N64’s native resolution and blurry VI filter to be unpalatable, we want to bring your attention to various things you can do to improve your graphics.

In this video we will be showing you how to apply a so-called ‘Super VI Mode’ filter in order to improve the N64’s graphics.

Note – how these shaders will perform depends entirely on the power of your GPU. The configuration you see later in the video (nnedi-4x) requires a lot more GPU power than the former one (2x). Be mindful of this.

This video will teach you:
* How to load shader presets
* How to stack additional shader chains on top of existing shader presets
* How to configure shader parameters to adjust the screen.

We hope this video will tickle your curiosity so that you will try to hit upon even more fancy shader configurations! The sky is the limit with RetroArch and our common shaders library.

Shader Changes

Abstract

GLSL shaders now preferred over Cg when possible
Update to latest RetroArch for compatibility with updated GLSL shaders

Cg shaders demoted, GLSL promoted to first-class

Portability and compatibility are major goals for RetroArch and libretro, so we invested heavily in Nvidia’s Cg shader language, which worked natively anywhere their Cg Toolkit framework was available (that is, Windows, Linux and Mac OS X), as well as on PS3 and Vita, and could be machine-compiled to messy-but-usable GLSL (lacking a few features, such as runtime parameters) for platforms that lacked the framework (primarily ARM / mobile platforms). Cg was also so close to Microsoft’s HLSL shader language that many Cg shaders will compile successfully with HLSL compilers, such as those available with Windows’ D3D driver and on Xbox 360.

This was great for us because we could write shaders once and have them work pretty much everywhere.
Sadly, Nvidia deprecated the Cg language in 2012, which left us in a bad spot. Since then, we’ve been limping along with the same strategy as before, but with the uneasy understanding that Nvidia could stop supplying their Cg Toolkit framework at any time. Rather than sit idly by, waiting for that other shoe to drop, we took it upon ourselves to hand-convert the vast majority of our Cg shaders to native GLSL with all of the bells and whistles. TroggleMonkey’s monstrous masterpiece, CRT-Royale, still has a couple of bugs but is mostly working, along with its popular BVM-styled variant from user Kurozumi. Additionally, before this conversion, many of our Cg shaders were flaky or completely unusable on libretro-gl cores, such as Beetle-PSX-HW’s OpenGL renderer, but these native GLSL conversions should work reliably and consistently with any core/context except for those that require Vulkan (namely, ParaLLEl-N64’s and Beetle-PSX-HW’s Vulkan renderers).

With the GLSL shaders brought up to speed, we can finally join Nvidia in deprecating Cg, though it will still remain as an option–that is, we’re not *removing* support for Cg shaders or contexts at this point–and we will continue to use it where there is no other choice; namely, Windows’ D3D driver and the Xbox 360, PS3 and Vita ports. Moving forward, our focus for shaders will be on native GLSL and our slang/Vulkan formats, though we will likely still port some to Cg from time to time.

RetroArch now correctly handles #version directives in GLSL shaders; GLSL shader repo updated to match

There have been a number of updates to the GLSL shader language/spec over its long life, and shader authors can use #version directives (that is, a line at the top of the shader that says #version 130 or whatever) to tell compilers which flavor/version of GLSL is required for that shader. However, RetroArch has long had a strange behavior whereby it injected a couple of lines at the beginning of all GLSL shader files at compile time, and this broke any shader that attempted to use a #version directive, since those directives must be on the first line of the shader. This meant that our shaders couldn’t use #version directives at all, and all of our shaders lacked #version directives until very recently for this reason. These #version-less GLSL shaders are still perfectly compliant GLSL because GLSL v1.10 didn’t support directives, either, but the necessity of leaving off the #version started to cause some problems as we whipped our GLSL shader library into shape.

The error caused by adding a #version directive under the old behavior.

On AMD and Nvidia GPUs, the compilers would just toss up a warning about the missing directive and still expose whatever GLSL features were available to the GPU, which worked out great. On Intel IGPs, however, the compiler tosses the error and then reverts to only exposing the features available in ancient GLSL v1.10 (released way back in 2004). As a stopgap, we gave many shaders fallback codepaths that would still work in these circumstances, but a number of other shaders were either impossible to make compatible or even the compatible result was imperfect.

So, as of this commit (courtesy of aliaspider), RetroArch will no longer reject shaders with explicit #version directives, and we have added those directives to any shaders that require them at the lowest version that still compiles/functions properly. That is, if the shader doesn’t use any features that require greater than #version 110, they will still have no #version specified, and any shader that requires #version 120 but not #version 130 will not have its requirements increased to the higher version for no reason. This should keep our GLSL shaders as compatible as possible with older hardware, and including the #versions explicitly when needed will also make it easier for other programs/developers to utilize our shaders without any unnecessary guesswork due to behind-the-scenes magic.

This change does require a clean break, insofar as older versions of RetroArch will choke on the new #version directives (that is, they’ll fail to compile with the “#version must occur before any other program statement” error pictured above), so users with Nvidia or AMD GPUs must update their RetroArch installation if they want to use the updated shaders. Users with Intel IGPs will be no worse off if they don’t update, since those shaders were already broken for them, but they’ll probably *want* to update to gain access to the many fancy shaders that now work properly on their machines.

Mobile GPUs using GLES had many of the same issues that Intel IGPs had, with many shaders refusing to work without #version directives, but GLES compatibility added in a further complication: GLES requires its own separate #version directives, either #version 100 es or #version 300 es, which are different from and incompatible with desktop GL’s #versions. To get around this, we added a trick in RetroArch to change any #version of 120 or below to #version 100, which is roughly comparable in features to 120, and any #version 130 or above to #version 300 es whenever a GLES context is used. This should get everything working as effectively and consistently as possible on mobile GPUs, but if anything slipped through the cracks, be sure to file an issue report at the GLSL shader repo.

RetroArch 1.4.1 Open Beta – Released! Highlights

Half a year after RetroArch 1.3.6 was released, now comes the next big stable! Version 1.4.1 is by any yardstick a big massive advance on the previous version. There are about 5000 commits or more to sift through, so let’s focus on a few big main standout features that we want to emphasize for this release.

Where to get it

https://buildbot.libretro.com/stable/1.4.1/

We are calling this release an ‘Open Beta’ because we want people to put the massively improved Netplay features through its paces! All of your feedback and issues will be taken onboard so that 1.5.0 (which we intend to ship somewhere beginning of March) will deliver on all the promises we have made for netplay.

Netplay

Netplay has seen a big massive improvement since version 1.3.6.

To set up a netplay game, you have two options: Manual or automatic connection.

Naturally, the automatic way is easier:

To host, just load a core and launch some content as usual and, once the game is running, go back into the ‘quick menu’ (the default keyboard shortcut is F1) and scroll down to the ‘netplay’ submenu. From there, choose ‘Start netplay host’ and it will announce your game to the lobby server for other users to join. You can go ahead and start playing and new players can jump in at any time. That is, RetroArch no longer stalls out until clients join.

Joining an existing session is just as easy. From the main menu, navigate over to the netplay tab (the icon looks like a wifi symbol), scroll down to ‘Refresh Room List’ and hit the ‘accept’ key/button (the default keyboard shortcut is the ‘X’ key). RetroArch will fetch the current list of available hosts and display them right there in the netplay tab. From there, just pick the host you wish to join and RetroArch will cycle through your playlists searching for a content match. If it finds a match, you’ll jump right into the host’s game already in progress.

To use manual connection, the host does the exact same steps. The client must load the same core and game first, then choose the “connect to netplay host” option from the netplay menu. You will be prompted for the IP address of the host. Enter it to connect.

To keep your games private, the host may set a password, required to connect, in the network settings menu.

We want your feedback and input on netplay, and the aim is that we take your feedback into consideration for 1.5.0 (which we will launch early March) to put the final finishing touches on netplay in general. Things like chat, friend lists and so on will all need to be implemented still.

Multi-language support/Japanese language support

We have added UTF-8 support and we have added translations for several languages now. Of these, Japanese is probably second to English in terms of being the most complete translation.

In addition to this, the new onscreen keyboard also has multilingual support, and supports Japanese fully (Hiragana, Katakana).

Free homebrew Bomberman clone game – Mr.Boom

Mr.Boom is a Bomberman clone. It supports up to 8 players and features like pushing bombs, remote controls and kangaroo riding.

This was an old MS-DOS/Windows 9x homebrew game that https://github.com/frranck converted over to C with a self-made tool he calls asm2c.

Right now, this core works for Mac/Windows/Linux/ We are still working on Android support!

Mr. Boom currently requires at least a minimum of 2 players. There is no singleplayer mode (yet). It can not yet be used with netplay but that is our ultimate aim! Free 8-player easy Bomberman-like gameplay for everybody! We will make an announcement later when netplay support is fully working for this core!

New menu graphical effects

In addition to the ribbon effects, we have added some new menu effects : Bokeh, and Snow.

Check the accompanying video to see them in action. You can access these menu effects by going to

Settings -> User Interface -> Menu and setting “Menu Shader Pipeline” to any effect of your choosing.

NOTE: These two new menu effects are not yet available for Vulkan and Cg. Ports would have to be made first of these menu effects, since they are completely shader-based.

Quality-of-Life improvements to the menu

We have taken all the criticisms of the menu UI to heart and we really pushed ourselves to make the menu much more pleasant to deal with.

  • We have gone to the painstaking effort of making sure that nearly every menu entry now has a small description below it.
  • Loading content has been massively streamlined. There is no longer a separate ‘Load Content’ and ‘Load Content (Detect Core)’ option. You simply select a starting point directory, you then select your game and you decide which core to use.
  • There is a new onscreen keyboard made for the menu which is compatible with touch and the mouse. It not only supports traditional western characters but thanks to improved multilingual support it will also support Japanese (Kanji, Hiragana, Katakana and Romaji).
  • In fullscreen mode, the mouse cursor inside the menu will only show for about 5 seconds. If there is no mouse activity it will disappear from the screen until you move the mouse again.

Improved error handling

Cores should now report an error message back to RetroArch in most instances where a ROM/content fails to load.  We went over most cores and we are reasonably comfortable in that we took care of most of the trouble spots.

Vulkan N64 and PSX now works on Android!

To read more about these projects, read our past articles here –

Introducing Vulkan PSX renderer for Beetle/Mednafen PSX

Nintendo 64 Vulkan Low-Level emulator – paraLLel – pre-alpha release

ParaLLel (Nintendo 64 core with Vulkan renderer) and Mednafen PSX HW should now work on Android devices that support Vulkan!

Unfortunately, GPU is currently not the bottleneck here. In the case of both of these emulators, more work is required before they will start to run at fullspeed on Android devices. We need to get the LLVM dynarec working on ARM devices.

In the case of Mednafen PSX HW, the interpreter CPU core is the main bottleneck which prevents the emulator from reaching playable speeds right now. An experimental dynarec was written a year ago but it still needs a lot of work before it could be considered ‘usable’.

Lots of other miscellaneous stuff

  • Improved performance
  • (Linux) DRM/KMS context driver should be more compatible now
  • (Linux) The GLX context driver now uses GLX_OML_sync_control when available, leading to much improved swap control. Potential video tearing and frame time deviation will be way lower in general.
  • (Linux) Attaching a PS4 gamepad will allow you to use the audio headphone jack to route sound to your headphones if you use the ALSA audio driver. It will now query the available audio output sampling rates that an audio device supports, and if the recommended output sampling rate that we use in RetroArch doesn’t match, we will use a sampling rate that the audio device DOES support instead. The PS4 pad only works with 32Khz audio, hence why we need to switch to it on the fly in order to get sound working with it.
  • (Android) Should fix a longstanding touch input bug that might have prevented touch from working altogether on certain devices.
  • (Android) GLES3/3.1 support, the fancy ribbon effect and Snow/Bokeh should also be available on Android now.
  • (Linux/Wayland) Full input support, keyboard and mouse.
  • Too much stuff to mention

Also read our companion article for more information here –

RetroArch 1.4.1 Major Changes Detailed!

And even more!

RetroArch 1.4.1 Progress report – DOS/Windows 9x/Windows 2K

Improved documentation

From now on, all documentation for RetroArch (both development and user-facing info) will be posted here –

https://buildbot.libretro.com/docs/