ParaLLEl-RDP – How the upscaled rendering works

This is a technical article on how upscaling in LLE works on the N64 RDP. Accurate upscaling in LLE is something which has not been done before (it has been done in a HLE framework, but accurate is the key word here), due to its extremely intense performance requirements, but with paraLLEl-RDP running on the GPU with Vulkan, this is now practical, and the results are faithful to what N64 games would look like if games rendered at a very high resolution. There are no compromises on accuracy, and I believe this is a correct representation of upscaling in a “what-if” scenario. The changes required to add this were actually fairly minimal, and there aren’t really any hacks involved. However, we have to be somewhat conservative in what we attempt to enhance.

Main concepts

Unified Memory Architecture – fully accurate frame buffer behavior

A complicated problem with the N64 is that the RDP and CPU have a unified memory architecture, and this complicates a lot. We must assume that the CPU can read arbitrary pixels that the RDP rendered, and the CPU can overwrite pixels written by the RDP earlier. In upscaling, this gets weird very quickly since the CPU does not understand upscaling. To support this, the GPU renders everything twice, once in the native domain, and finally in the upscaled domain. With this approach, the CPU cannot observe that upscaling is happening. It also improves performance in synchronous mode, since we can just render native resolution before we unblock CPU, and the GPU can go on to render upscaled render passes asynchronously, which takes a longer time.

Rasterization at sub-pixel precision

The core mathematical problem to solve for upscaling is how we are going to rasterize at sub-pixel precision. This gets somewhat interesting, since the RDP is fully defined in fixed-point, and there is limited precision available. Fortunately, there are enough bits of precision that we can add extra sub-pixel precision to the rasterization equations. 8x is the theoretically maximum upscaling we can achieve without going beyond 32-bit fixed point math. 8x is complete overkill, 2x and 4x are more than enough anyways.

Instancing RDRAM

Given that we have a requirement of unified memory architecture, paraLLEl-RDP directly implements a unified memory architecture (UMA) as mentioned above where the GPU reads and writes directly into RDRAM. This ensures full accuracy, and this is usually where HLE fails, as implementing UMA at this level is not practical with the traditional graphics pipeline in GPUs. To extend paraLLEl-RDP’s approach to upscaling, I went with multiple copies of RDRAM, one copy for each sub-sample. This works really well, because at any time, if we detect that any write happens in an unscaled context, e.g. CPU writes, we can simply duplicate samples up to upscaled domain. This is essentially some kind of faux MSAA where each pixel has multiple samples associated with it. This is the memory we end up allocating for a 4x upscale (4×4 = 16 samples):

  • RDRAM (8 MB) – Allocated on host with VK_EXT_external_memory_host. This is fully coherent with emulated CPU.
  • Hidden RDRAM (4 MB) – Device local
  • RDRAM reference buffer (8 MB) – Device local
  • Multisampled RDRAM (8 * 16 MB) – Device local
  • Multisampled Hidden RDRAM (4 * 16 MB) – Device local

The reference buffer is there so we can track when CPU writes to RDRAM. Essentially, before we render anything on the GPU, we compare RDRAM against the reference buffer. If there is a difference, the CPU must have clobbered the pixel, and the RDRAM is now duplicated to all the samples of RDRAM. After rendering something, we update the reference buffer, so we know it’s safe to use upscaled pixels later.

When rendering an upscaled pixel (X, Y), we convert the coordinate to native pixel (X, Y) and convert the sub-pixel to an RDRAM instance, e.g.:

ivec2 upscaled_pixel = ivec2(x, y);
ivec2 subpixel = upscaled_pixel & (SCALING_FACTOR - 1);
ivec2 native_pixel = upscaled_pixel >> SCALING_LOG2;
int rdram_instance = subpixel.y * SCALING_FACTOR + subpixel.x;
read_write_rdram(native_pixel, rdram_instance);

Upscaled VI interface

Adding upscaling to the VI interface is fairly straight forward since we can convert e.g. 16 samples back to a 4×4 block of pixels. From there, we just follow the exact same algorithms that we do for native rendering. This means we get correct VI AA, divot and de-dither happening at high resolution.

Modifying rasterization rules

The RDP is a span rasterizer, a very classic design. The rasterization rules are extremely specific and cannot be accurately represented using normal OpenGL/Vulkan triangle rasterization rules, which are based on barycentric plane equations (to the best of my knowledge you can only approximate).

The RDP receives pre-computed triangle setup data from the RSP. We specify three lines with the triangle setup, where one line is the “major” line XH, and a second line is picked from the two “minor” lines XM/XL, depending on y >= YM. Two values YH and YL limit which scanlines we should render. This lets us implement triangles, or more complicated primitives if we want to. Bisqwit made a really cool ongoing video series on software rendering a while back which also implements a span rasterizer, which is very useful to watch if you want a deeper understanding of this approach.

This triangle setup data is defined more specifically as:

  • XH, XM, XL: 32-bit values in the format of s12.15.x. The 4 MSB are sign-extended, and the single LSB is ignored (we can exploit this bit for more precision later!)
  • dXHdy, dXMdy, dXLdy: 32-bit values in the format of s12.13.xxx. 4 MSBs are sign-extended, and 3 LSBs are ignored. This represents the slope of the line for XH, XM and XL.
  • YH: This is a s12.2 value which represents the first scanline we render. There is 2 bits of subpixel precision, which is very useful because the RDP will sample coverage for 4 sub-scanlines per scanline.
  • YM: This s12.2 value represents the first sub-scanline where XL is selected as the minor line, otherwise XM is used.
  • YL: This represents the final sub-scanline which is rendered. The sub-scanline of YL is not included in rasterization.

The algorithm for native resolution in GLSL:

// Interpolate X at all 4 Y-subpixels.
// Check Y dimension.
int yh_interpolation_base = int(setup.yh) & ~(SUBPIXELS - 1);
int ym_interpolation_base = int(setup.ym);

int y_sub = int(y * SUBPIXELS);
ivec4 y_subs = y_sub + ivec4(0, 1, 2, 3);

// dxhdy and others are (setup value >> 2) since we're stepping one sub-scanline at a time, not whole lines. This is why more LSBs are ignored for the slopes.
ivec4 xh = setup.xh + (y_subs - yh_interpolation_base) * setup.dxhdy;
ivec4 xm = setup.xm + (y_subs - yh_interpolation_base) * setup.dxmdy;
ivec4 xl = setup.xl + (y_subs - ym_interpolation_base) * setup.dxldy;
xl = mix(xl, xm, lessThan(y_subs, ivec4(setup.ym)));

ivec4 xh_shifted = quantize_x(xh); // A very specific quantizer, see source ...
ivec4 xl_shifted = quantize_x(xl);

ivec4 xleft, xright;
if (flip) // Flip is a bit set in triangle setup to mark primitive winding.
{
    xleft = xh_shifted;
    xright = xl_shifted;
}
else
{
    xleft = xl_shifted;
    xright = xh_shifted;
}

We have now computed a range of which pixels to render for each sub-scanline, where [xleft, xright) is the range. If xright <= xleft, the sub-scanline does not receive coverage. The quantizer is somewhat esoteric, but we essentially quantize X down to 8 sub-pixels of precision (>> 13). This is used later for multi-sampled coverage in the X dimension.

To add upscaling, the modifications are straight forward.

int yh_interpolation_base = int(setup.yh) & ~(SUBPIXELS - 1);
int ym_interpolation_base = int(setup.ym);
yh_interpolation_base *= SCALING_FACTOR;
ym_interpolation_base *= SCALING_FACTOR;

int y_sub = int(y * SUBPIXELS);
ivec4 y_subs = y_sub + ivec4(0, 1, 2, 3);

// Interpolate X at all 4 Y-subpixels.
ivec4 xh = setup.xh * SCALING_FACTOR + (y_subs - yh_interpolation_base) * setup.dxhdy;
ivec4 xm = setup.xm * SCALING_FACTOR + (y_subs - yh_interpolation_base) * setup.dxmdy;
ivec4 xl = setup.xl * SCALING_FACTOR + (y_subs - ym_interpolation_base) * setup.dxldy;
xl = mix(xl, xm, lessThan(y_subs, ivec4(SCALING_FACTOR * setup.ym)));

This is an accurate representation, as the only thing we do here is to shift in more bits into triangle setup, as long as this does not overflow, we’re golden. After this step, we have scissoring. Scissor coordinates are u10.2 fixed point, so it means the maximum resolution for the RDP is 1024×1024. With 8x upscale and 8 sub-pixels of X precision, we can barely pack the resulting range in unsigned 16-bits without overflow.

Modifying varying interpolation

Attribute interpolation is a little more interesting. There are 8 varyings, which all have the same setup data:

  • Shade Red/Green/Blue/Alpha
  • S
  • T
  • 1/W
  • Z

Each varying has 4 values:

  • Base value – sampled at coordinate (XH, YH) (kinda … it’s complicated)
  • dVdx – Change in value for 1 pixel in X dimension
  • dVde – Change in value when following the major axis down one line, and sampling at the next line’s XH. Basically dVde = dVdx * dXdy + dVdy. I’m not sure why this even exists, it makes the interpolation math a little easier I suppose?
  • dVdy – This feels very redundant, but it is what it is. It is only used for coverage fixup and LOD computation.

We cannot shift in extra bits here, unlike rasterization, so we have to be a little creative here. To stay faithful, and avoid overflow, we need to ensure that the interpolation is correct for each sample point which matches sample points for native resolution, and for the inner sub-pixels, we remove some bits of precision in the derivative. Essentially, instead of doing something like this (not the correct math, see code, here for brevity):

int base_interpolated_x = ((setup.xh + (y - base_y) * setup.dxhdy)) >> 16;
rgba = attr.rgba;
int dy = y - base_y;
int dx = x - base_interpolated_x;
rgba += dy * attr.drgba_de;
rgba += dx * attr.drgba_dx;

we do …

int base_interpolated_x = ((setup.xh + (y - base_y) * setup.dxhdy)) >> 16;
rgba = attr.rgba;
int dy = y - base_y;
int dx = x - base_interpolated_x;
rgba += (dy >> SCALING_LOG2) * attr.drgba_de + (dy & (SCALING_FACTOR - 1)) * (attr.drgba_de >> SCALING_LOG2);
rgba += (dx >> SCALING_LOG2) * attr.drgba_dx + (dx & (SCALING_FACTOR - 1)) * (attr.drgba_dx >> SCALING_LOG2);

The added error here is microscopic.

Workarounds

Some games do not work correctly when we upscale, since the game never intended to render sub-pixels. This usually comes into play in two major scenarios, which we need to workaround.

Using LOD for clever hackery

The mip-mapping on N64 is quite flexible, and sometimes two entirely different textures represent LOD 0 and LOD 1 for smooth distance based effects. When upscaling with e.g. 4x, we essentially get a LOD factor which is a LOD bias of -2 (log2(1/4)). An optional workaround is to compensate by applying a positive LOD bias ourselves to emit LOD levels the game expects. Ideally, this workaround is applied only in places where it’s needed.

Sprite rendering / TEX_RECT

Many games render sprites with TEX_RECT with the expectation that textures are rendered 1:1 with input texels to output texels. When we start upscaling, the game might have forgot to disable bilinear filtering, and we start filtering outside the texture boundaries, i.e., against garbage, which shows up as ugly seams in the image. The simple workaround is to render TEX_RECT primitives as if they are not upscaled. This is necessary anyways for the COPY pipe, since the COPY pipe only updates the varying interpolator every 8th framebuffer byte. We cannot safely upscale these kinds of primitives either way.

Conclusion

There isn’t much more to it. Adding upscaling to ParaLLEl-RDP was not all that complicated compared to the other insanity that went into making this renderer work. It’s a principled approach to the upscaling which I believe could theoretically work in a custom RDP hardware design.

paraLLEl N64 – Low-level RDP upscaling is finally here!

ParaLLEl RDP this year has singlehandedly caused a breakthrough in N64 emulation. For the first time, the very CPU-intensive accurate Angrylion renderer was lifted from CPU to GPU thanks to the powerful low-level graphics API Vulkan. This combined with a dynarec-powered RSP plugin has made low-level N64 emulation finally possible for the masses at great speeds on modest hardware configurations.

ParaLLEl RDP Upscaling

Jet Force Gemini running with 2x internal upscale
Jet Force Gemini running with 2x internal upscale

It quickly became apparent after launching ParaLLEl RDP that users have grown accustomed to seeing upscaled N64 graphics over the past 20 years. So something rendering at native resolution, while obviously accurate, bit-exact and all, was seen as unpalatable to them. Many users indicated over the past few weeks that upscaling was desired.

Well, now it’s here. ParaLLEl RDP is the world’s first Low-Level RDP renderer capable of upscaling. The graphics output you get is unlike any HLE renderer you’ve ever seen before for the past twenty years, since unlike them, there is full VI emulation (including dithering, divot filtering, and basic edge anti-aliasing). You can upscale in integer steps of the base resolution. When you set resolution upscaling to 2x, you are multiplying the input resolution by 2x. So 256×224 would become 512×448, 4x would be 1024×896, and 8x would be 2048×1792.

Now, here comes the good stuff with LLE RDP emulation. As said before, unlike so many HLE renderers, ParaLLEl RDP fully emulates the RCP’s VI Interface. As part of this interface’s postprocessing routines, it automatically applies an approximation of 8x MSAA (Multi-Sampled Anti-Aliasing) to the image. This means that even though our internal resolution might be 1024×896, this will then be further smoothed out by this aggressive AA postprocessing step.

Super Mario 64 running on ParaLLEl RDP with 2x internal upscale
Super Mario 64 running on ParaLLEl RDP with 2x internal upscale

This results in even games that run at just 2x native resolution looking significantly better than the same resolution running on an HLE RDP renderer. Look for instance at this Mario 64 screenshot here with the game running at 2x internal upscale (512×448).

How to install and set it up

RDP upscaling is available right now on Windows, Linux, and Android. We make no guarantees as to what kind of performance you can expect across these platforms, this is all contingent on your GPU’s Vulkan drivers and its compute power.

Anyway, here is how you can get it.

  • In RetroArch, go to Online Updater.
  • (If you have paraLLEl N64 already installed) – Select ‘Update Installed Cores’. This will update all the cores that you already installed.
  • (If you don’t have paraLLEl N64 installed already) – go to ‘Core Updater’ (older versions of RA) or ‘Core Downloader’ (newer version of RA), and select ‘Nintendo – Nintendo 64 (paraLLEl N64)’.
  • Now start up a game with this core.
  • Go to the Quick Menu and go to ‘Options’. Scroll down the list until you reach ‘GFX Plugin’. Set this to ‘parallel’. Set ‘RSP plugin’ to ‘parallel’ as well.
  • For the changes to take effect, we now need to restart the core. You can either close the game or quit RetroArch and start the game up again.

In order to upscale, you need to first set the Upscaling factor. By default, it is set to 1x (native resolution). Setting it to 2x/4x/8x then restarting the core makes the upscaling take effect.

Explanation of core options

A few new core option features have been added. We’ll briefly explain what they do and how you can go about using them.

  • (ParaLLEl-RDP) Upscaling factor (Restart)

Available options: 1x, 2x, 4x, 8x

The upscaling factor for the internal resolution. 1x is default and is the native resolution. 2x, 4x, and 8x are all possible. NOTE: It bears noting that 8x requires at least 5GB/6GB VRAM on your GPU. System requirements are steep for 8x and we generally don’t recommend anything less than a 1080 Ti or better for this. Your mileage may vary, just be forewarned. 2x and 4x by comparison are much lighter. Even when upscaling, the rendering is still rendering at full accuracy, and it is still all software rendered on the GPU. 4x upscale means 16x times the work that 1x Angrylion would churn through.

  • (paraLLEl-RDP) Downsampling

Available options: Disabled, 1/2, 1/4, 1/8

Also known as SSAA, this works pretty similar to the SSAA downscaling feature in Beetle PSX HW’s Vulkan renderer. The idea is that you internally upscale at a higher resolution, then set this option from ‘Disabled’ to any of the other values. What happens from there is that this internal higher resolution image is then downscaled to either half its size, one quarter of its size, or one eight of its size. This gives you a very smoothed out anti-aliased picture that for all intents and purposes still outputs at 240p/240i. From there, you can apply some frontend shaders on top to create a very nice and compelling look that still looks better than native resolution but is also still very faithful to it.

So, if you would want 4x resolution upscaling with 4x SSAA, you’d set ‘Downsample’ to ‘1/2’. With 4x upscale, and 1/4 downsample, you get 240p output with 16x SSAA, which looks great with CRT shaders.

  • (paraLLEl-RDP) Use native texture LOD when upscaling

This option is disabled by default.

We have so far only found one game that absolutely required this to be turned on for gameplay purposes. If you don’t have this enabled, the Princess-to-Bowser painting transition in Mario 64 is not there and instead you just see Bowser in the portrait from a far distance. There might be further improvements later to attempt to automatically detect these cases.

Most N64 games didn’t use mipmapping, but the ones that do on average benefit from this setting being off – you get higher quality LOD textures instead of a lower-quality LOD texture eventually making way for a more detailed one as you look closer. However, turning this option on could also be desirable depending on whether you favor accurate looking graphics or a facsimile of how things used to look.

  • (paraLLEl-RDP) Use native resolution for TEX_RECT

This option is on by default.

2D elements such as sprites are usually rendered with TEX_RECT commands, and trying to upscale them inevitably leads to ugly “seams” in the picture. This option forces native resolution rendering for such sprites.

 

Managing expectations

It’s important that people understand what the focus of this renderer is. There is no intent to have yet another enhancement-focused renderer here. This is the closest there has ever been to date of a full software rendered reimplementation of Angrylion on the GPU with additional niceties like upscaling. The renderer guarantees bit-exactness, what you see is what you would get on a real N64, no exceptions.

With a HLE renderer, the scene is rendered using either OpenGL or Vulkan rasterization rules. Here, neither is done – the exact rasterization steps of the RDP are followed instead, there are no API calls to GL to draw triangles here or there. So how is this done? Through compute shaders. It’s been established that you cannot correctly emulate the RDP’s rasterization rules by just simply mapping it to OpenGL. This is why previous attempts like z64gl fell flat after an initial promising start.

So the value proposition here for upscaling with ParaLLEl RDP is quite compelling – you get upscaling with the most accurate renderer this side of Angrylion. It runs well thanks to Vulkan, you can upscale all the way to 8x (which is an insane workload for a GPU done this way). And purists get the added satisfaction of seeing for the first time upscaled N64 graphics using the N64’s entire postprocessing pipeline finally in action courtesy of the VI Interface. You get nice dither filtering that smooths out really well at higher resolutions and can really fake the illusion of higher bit depth. HLE renderers have a lot of trouble with the kind of depth cuing and dithering being applied on the geometry, but ParaLLEl RDP does this effortlessly. This causes the upscaled graphics to look less sterile, whereas with traditional GL/Vulkan rasterization, you’d just see the same repeated textures everywhere with the same basic opacity everywhere. Here, we get dithering and divot filtering creating additional noise to the image leading to an overall richer picture.

So basically, the aim here is actually emulating the RDP and RSP. The focus is not on getting the majority of commercial games to just run and simulating the output they would generate through higher level API calls.

Won’t be done – where HLE wins

Therefore, the following requests will not be pursued at least in the near future:

* Widescreen rendering – Can be done through game patches (ASM patches applied directly to the ROM, or bps/ups patches or something similar). Has to be done on a per-game basis, with HLE there is some way to modify the view frustum and viewport dimensions to do this but it almost never works right due to the way the game occludes geometry and objects based on your view distance, so game patches implementing widescreen and DOF/draw distance enhancements would always be preferable.

So, in short, yes, you can do this with ParaLLEl RDP too, just with per-game specific patches. Don’t expect a core option that you can just toggle on or off.
* Rendering framebuffer effects at higher resolution – not really possible with LLE, don’t see much payoff to it either. Super-sampled framebuffer effects might be possible in theory.
* Texture resolution packs – Again, no. The nature of an LLE renderer is right there in the name, Low-Level. While the RDP is processing streams of data (fed to it by the RSP), there is barely any notion whatsoever of a ‘texture’ – it only sees TMEM uploads and tile descriptors which point to raw bytes. With High Level emulation, you have a higher abstraction level where you can ‘hook’ into the parts where you think a texture upload might be going on so you can replace it on the fly. Anyway, those looking for something like that are really at the wrong address with ParaLLEl RDP anyway. ParaLLEl RDP is about making authentic N64 rendering look as good as possible without resorting to replacing original assets or anything bootleg like that.
* Z-fighting/subpixel precision: In some games, there is some slight Z-fighting in the distance that you might see which HLE renderers typically don’t have. Again, this is because this is accurate RDP emulation. Z-fighting is a thing. The RDP only has 18-bit UNORM of depth precision with 10 bits of fractional precision during interpolation, and compression on top of that to squeeze it down to 14 bits. A HLE emulator can render at 24+ bits depth. Compounding this, because the RSP is Low-level, it’s sending 16-bit fixed point vertex coordinates to the RDP for rendering. A typical HLE renderer and HLE RSP would just determine that we are about to draw some 3D geometry and then just turn it into float values so that there is a higher level of precision when it comes to vertex positioning. If you recall, the PlayStation1’s GTE also did not deal with vertex coordinates in floats but in fixed point. There, we had to go to the effort of doing PGXP in order to convert it to float. I really doubt there is any interest to contemplate this at this point. Best to let sleeping dogs lie.
* Raw performance. HLE uses the hardware rasterization and texture units of the GPU which is far more efficient than software, but of course, it is far less accurate than software rendering.

Where LLE wins

Conversely, there are parts where LLE wins over HLE, and where HLE can’t really go –

* HLE tends to struggle with decals, depth bias doesn’t really emulate the RDP’s depth bias scheme at all since RDP depth bias is a double sided test. Depth bias is also notorious for behaving differently on different GPUs.
* Correct dithering. A HLE renderer still has to work with a fixed function blending pipeline. A software rendered rasterizer like ParaLLEl RDP does not have to work with any pre-existing graphics API setup, it implements its own rasterizer and outputs that to the screen through compute shading. Correct dither means applying dither after blending, among other things, which is not something you can generally do [with HLE]. It generally looks somewhat tacky to do dithering in OpenGL. You need 8-bit input and do blending in 8-bit but dither + quantization at the end, which you can’t do in fixed function blending.
* The entire VI postprocessing pipeline. Again, it bears repeating that not only is the RDP graphics being upscaled, so is the VI filtering. VI filtering got a bad rep on the N64 because this overaggressive quasi-8x MSAA would tend to make the already low-resolution images look even blurrier. But at higher resolutions as you can see here, it can really shine. You need programmable blending to emulate the VI’s coverage, and this just is not practical with OpenGL and/or the current HLE renderers out there. The VI has a quite ingenious filter that distributes the dither noise where it reconstructs more color depth. So not only are we getting post-processing AA courtesy of the RDP, we’re also getting more color depth.
* What You See Is What You Get. This renderer is the hypothetical what-if scenario of how an N64 Pro unit would look like that could pump out insane resolutions while still having the very same hardware. Notice that the RDP/VI implementations in ParaLLEl RDP have NOT been enhanced in any way. The only real change was modifying the rasterizer to test fractional pixel coordinates as well.
* Full accuracy with CPU readbacks. CPU can freely read and write on top of RDP rendered data, and we can easily deal with it without extra hacks.

Known issues

  • The deinterlacing process for interlaced video modes is still rather poor (just like Angrylion), basic bob and weave setup. There are plans to come up with a completely new system.
  • Mario Tennis glitches out a bit with upscaling for some reason, there might be subtle bugs in the implementation that only manifest on that game. This seems to not happen on Nvidia Windows drivers though.

Screenshots

The screenshots below here show ParaLLEl RDP running at its maximum internal input resolution, 8x the original native image. This means that when your game is running at say 256×224, it would be running at 2048×1792. But if your game is running at say 640×480 (some interlaced games actually set the resolution that high, Indiana Jones IIRC), then we’d be looking at 5120×3840. That’s bigger than 4K! Then bear in mind that on top of that you’re going to get the VI’s 8x MSAA on top of that, and you can probably begin to imagine just how demanding this is on your GPU given that it’s trying to run a custom software rasterizer on hardware. Suffice it to say, the demands for 2x and 4x will probably not be too steep, but if you’re thinking of using 8x, you better bring some serious GPU horsepower. You’ll need at least 5-6GB of VRAM for 8x internal resolution for starters.

Anyway, without much further ado, here are some glorious screenshots. GoldenEye 007 now looks dangerously close to the upscaled bullshot images on the back of the boxart!

GoldenEye 007 running with ParaLLEl RDP at 8x internal upscale
GoldenEye 007 running with ParaLLEl RDP at 8x internal upscale
Super Mario 64 running on ParaLLEl RDP with 8x internal upscale
Super Mario 64 running on ParaLLEl RDP with 8x internal upscale
Star Fox 64 running on ParaLLEl RDP with 8x internal upscale
Star Fox 64 running on ParaLLEl RDP with 8x internal upscale
Perfect Dark running on ParaLLEl RDP with 8x internal upscale in high-res mode
Perfect Dark running on ParaLLEl RDP with 8x internal upscale in high-res mode
World Driver Championship running on ParaLLEl RDP with 8x internal upscale
World Driver Championship running on ParaLLEl RDP with 8x internal upscale

Videos

Body Harvest

Perfect Dark

Legend of Zelda: Ocarina of Time

Super Mario 64

Coming to Mupen64Plus Next soon

ParaLLEl RDP will also be making its way into the upcoming new version of Mupen64Plus Next as well. Expect increased compatibility over ParaLLEl N64 (especially on Android) and potentially better performance in many games.

Future blog posts

There might eventually be some future blog post by Themaister going into more technical detail on the inner workings of ParaLLEl RDP. I will also probably release a performance test-focused blog post later testing a variety of different GPUs and how far we can take them as far as upscaling is concerned.

I can already tell you to neuter your expectations with regards to Android/mobile GPUs. I tested ParaLLEl RDP with 2x upscaling on a Samsung Galaxy S10+ and performance was about 36fps, this is with vsync off. With 1x native resolution I manage to get on average 64 to 70fps with the same games. So obviously mobile GPUs still have a lot of catching up to do with their discrete big brothers on the desktop.

At least it will make for a nice GPU benchmark for mobile hardware until we eventually crack fullspeed with 2x native!