RetroArch 1.4.1 Open Beta – Released! Highlights

Half a year after RetroArch 1.3.6 was released, now comes the next big stable! Version 1.4.1 is by any yardstick a big massive advance on the previous version. There are about 5000 commits or more to sift through, so let’s focus on a few big main standout features that we want to emphasize for this release.

Where to get it

We are calling this release an ‘Open Beta’ because we want people to put the massively improved Netplay features through its paces! All of your feedback and issues will be taken onboard so that 1.5.0 (which we intend to ship somewhere beginning of March) will deliver on all the promises we have made for netplay.


Netplay has seen a big massive improvement since version 1.3.6.

To set up a netplay game, you have two options: Manual or automatic connection.

Naturally, the automatic way is easier:

To host, just load a core and launch some content as usual and, once the game is running, go back into the ‘quick menu’ (the default keyboard shortcut is F1) and scroll down to the ‘netplay’ submenu. From there, choose ‘Start netplay host’ and it will announce your game to the lobby server for other users to join. You can go ahead and start playing and new players can jump in at any time. That is, RetroArch no longer stalls out until clients join.

Joining an existing session is just as easy. From the main menu, navigate over to the netplay tab (the icon looks like a wifi symbol), scroll down to ‘Refresh Room List’ and hit the ‘accept’ key/button (the default keyboard shortcut is the ‘X’ key). RetroArch will fetch the current list of available hosts and display them right there in the netplay tab. From there, just pick the host you wish to join and RetroArch will cycle through your playlists searching for a content match. If it finds a match, you’ll jump right into the host’s game already in progress.

To use manual connection, the host does the exact same steps. The client must load the same core and game first, then choose the “connect to netplay host” option from the netplay menu. You will be prompted for the IP address of the host. Enter it to connect.

To keep your games private, the host may set a password, required to connect, in the network settings menu.

We want your feedback and input on netplay, and the aim is that we take your feedback into consideration for 1.5.0 (which we will launch early March) to put the final finishing touches on netplay in general. Things like chat, friend lists and so on will all need to be implemented still.

Multi-language support/Japanese language support

We have added UTF-8 support and we have added translations for several languages now. Of these, Japanese is probably second to English in terms of being the most complete translation.

In addition to this, the new onscreen keyboard also has multilingual support, and supports Japanese fully (Hiragana, Katakana).

Free homebrew Bomberman clone game – Mr.Boom

Mr.Boom is a Bomberman clone. It supports up to 8 players and features like pushing bombs, remote controls and kangaroo riding.

This was an old MS-DOS/Windows 9x homebrew game that converted over to C with a self-made tool he calls asm2c.

Right now, this core works for Mac/Windows/Linux/ We are still working on Android support!

Mr. Boom currently requires at least a minimum of 2 players. There is no singleplayer mode (yet). It can not yet be used with netplay but that is our ultimate aim! Free 8-player easy Bomberman-like gameplay for everybody! We will make an announcement later when netplay support is fully working for this core!

New menu graphical effects

In addition to the ribbon effects, we have added some new menu effects : Bokeh, and Snow.

Check the accompanying video to see them in action. You can access these menu effects by going to

Settings -> User Interface -> Menu and setting “Menu Shader Pipeline” to any effect of your choosing.

NOTE: These two new menu effects are not yet available for Vulkan and Cg. Ports would have to be made first of these menu effects, since they are completely shader-based.

Quality-of-Life improvements to the menu

We have taken all the criticisms of the menu UI to heart and we really pushed ourselves to make the menu much more pleasant to deal with.

  • We have gone to the painstaking effort of making sure that nearly every menu entry now has a small description below it.
  • Loading content has been massively streamlined. There is no longer a separate ‘Load Content’ and ‘Load Content (Detect Core)’ option. You simply select a starting point directory, you then select your game and you decide which core to use.
  • There is a new onscreen keyboard made for the menu which is compatible with touch and the mouse. It not only supports traditional western characters but thanks to improved multilingual support it will also support Japanese (Kanji, Hiragana, Katakana and Romaji).
  • In fullscreen mode, the mouse cursor inside the menu will only show for about 5 seconds. If there is no mouse activity it will disappear from the screen until you move the mouse again.

Improved error handling

Cores should now report an error message back to RetroArch in most instances where a ROM/content fails to load.  We went over most cores and we are reasonably comfortable in that we took care of most of the trouble spots.

Vulkan N64 and PSX now works on Android!

To read more about these projects, read our past articles here –

Introducing Vulkan PSX renderer for Beetle/Mednafen PSX

Nintendo 64 Vulkan Low-Level emulator – paraLLel – pre-alpha release

ParaLLel (Nintendo 64 core with Vulkan renderer) and Mednafen PSX HW should now work on Android devices that support Vulkan!

Unfortunately, GPU is currently not the bottleneck here. In the case of both of these emulators, more work is required before they will start to run at fullspeed on Android devices. We need to get the LLVM dynarec working on ARM devices.

In the case of Mednafen PSX HW, the interpreter CPU core is the main bottleneck which prevents the emulator from reaching playable speeds right now. An experimental dynarec was written a year ago but it still needs a lot of work before it could be considered ‘usable’.

Lots of other miscellaneous stuff

  • Improved performance
  • (Linux) DRM/KMS context driver should be more compatible now
  • (Linux) The GLX context driver now uses GLX_OML_sync_control when available, leading to much improved swap control. Potential video tearing and frame time deviation will be way lower in general.
  • (Linux) Attaching a PS4 gamepad will allow you to use the audio headphone jack to route sound to your headphones if you use the ALSA audio driver. It will now query the available audio output sampling rates that an audio device supports, and if the recommended output sampling rate that we use in RetroArch doesn’t match, we will use a sampling rate that the audio device DOES support instead. The PS4 pad only works with 32Khz audio, hence why we need to switch to it on the fly in order to get sound working with it.
  • (Android) Should fix a longstanding touch input bug that might have prevented touch from working altogether on certain devices.
  • (Android) GLES3/3.1 support, the fancy ribbon effect and Snow/Bokeh should also be available on Android now.
  • (Linux/Wayland) Full input support, keyboard and mouse.
  • Too much stuff to mention

Also read our companion article for more information here –

RetroArch 1.4.1 Major Changes Detailed!

And even more!

RetroArch 1.4.1 Progress report – MS-DOS/Windows 9x/Windows 2K

Improved documentation

From now on, all documentation for RetroArch (both development and user-facing info) will be posted here –

RetroArch 1.4.1 Progress report – DOS/Windows 9x/Windows 2K

RetroArch has now been ported to Windows 98SE/2000 as well as DOS. These are very early work-in-progress ports but in their current state do allow you to start up RetroArch and load a core/game.


For Windows, the current releases and nightly builds do not support XP or below due to changes in the msys2/mingw toolchain. While older Windows versions are indeed supported by the RetroArch codebase, they need to be manually compiled with Visual Studio (Express or Pro) to run properly. For XP and above, Visual Studio 2010 is supported. The solution/project file is located in the pkg/msvc folder of the source along with older msvc solutions. For Windows 98/2000, we support Visual Studio 2005. A DirectX 9.0c SDK is also required, and in order to target 98, a version no newer than December 2006 must be used.

The Windows 98/2000 port may work with our existing OpenGL driver if your graphics card supports a high enough version of OpenGL, but this has not yet been tested. So far 98/2000 has only been tested against a new experimental GDI video driver which does not require hardware acceleration like OpenGL or DirectX (the GDI driver works on newer Windows versions as well). With the GDI driver, the RGUI menu is fully supported and there is also preliminary support for XMB with minimal (text-only) rendering.

For input/joypad and audio support on 98/2000, the DirectX 9.0c runtime should continue to work as it does with newer Windows versions. Windows 98SE requires a DirectX runtime no newer than December 2006, and Windows 2000 can go up to February 2010.

Cores for 98/2000 will also currently need to be compiled manually due to the mingw toolchain used by the buildbot. It’s possible that we may setup a new buildbot target for the older Windows ports at a later date.


The DOS port requires DJGPP to compile (we cross-compile from Linux), and also requires the CWSDPMI server included with that toolchain to access 32-bit protected mode. An experimental “Mode 13h” VGA driver is implemented to provide 320×200 video with 256 colors. Keyboard input support is currently very minimal, only the A/B/X/Y, Start/Select and arrow keys work. There is also no audio support yet.

Cores for DOS will need to be compiled manually, as well as statically linked with RetroArch itself, similar to how our console ports work. This means that a compiled RetroArch EXE file will correspond to just one specific built-in core. FCEUmm and Snes9x2010 are known to work, but due to the default timer tick in DOS being 18hz, gameplay is currently very slow. Work is ongoing to reprogram the interrupt timer which should allow full speed gameplay.


No release yet! You have to compile from source,  and things are still very much a Work-In-Progress!

Compilation instructions

Compilation instructions will be added at a certain point on our Documentation site.

RetroArch 1.4.1 Major Changes Detailed!

It’s been such a long time since we last released a stable (1.3.6), so let’s give some indication of just how much work we’ve poured into the entire ecosystem surrounding RetroArch and libretro since!


RobLoach added scanning GoodNES and GoodN64 sets (i.e., instead of just No-Intro sets) and support for playlists using the ScummVM, NXEngine and Lutro cores. He also added a database of SNES translations, an oft-requested feature, and Super Mario World romhacks, so many of those hacks should show up in scans now.

bparker did a significant overhaul of the content-scanning backend to allow for recursive scanning–that is, the ability to scan inside subdirectories without going into each individual subdir manually. This was one of the most frequently requested scanning improvements. bparker also modified the scanning function to peek at the file extensions supported by the scan before comparing against a database, which speeds up the process substantially. In brief and nonscientific testing, scanning a full NES No-Intro set of 2,703 ROMs went from 4 minutes and 22 seconds down to just 53 seconds, representing a 491% speedup, while scanning an SNES No-Intro set of 3,442 ROMs went from 9 minutes and 37 seconds to 1 minute and 35 seconds, for a whopping 607% speedup.

7zip scanning

bparker also upgraded the 7zip support to a full first-class citizen (that is, right up there with standard zips), which means you can now scan/load/whatever content that has been compressed into a 7zip archive.


Radius greatly improved and expanded the config override functionality, adding the ability to save overrides directly from the RetroArch menu, rather than having to create them manually with a text editor. He also added the ability to save per-core and per-game shader presets, which is the main thing users wanted to change from core to core (along with retropad mapping, which is already handled via the per-core/per-game remapping function). With all of this added functionality, we decided to completely remove the conflicting and often broken and unpredictable “per-core configs” option, which had already been deprecated but kept around for legacy/transition support.


Meanwhile, GregorR has done the impressive and unenviable task of dusting off and overhauling RetroArch’s lag-hiding, peer-to-peer netplay implementation, which had been some of the least-touched, least-understood code in the entire codebase. He has already made great strides on the stability of connections, with a big reduction in (if not outright elimination of) out-of-the-blue desyncs, along with graceful recovery of synchronization following temporary losses of connectivity. Switching from UDP to TCP communication has made it so that only the host needs ports forwarded, which should help with playing games with less-technical friends, and the ability to search for hosts on the same LAN makes it easy to do Japanese arcade-style head-to-head matchups. GregorR also added support for 3+ player netplay, so you can throw down on party classics like Super Bomberman 5 with four of your closest friends. Pursuant to our Patreon goals, we’ll be starting on a netplay matchmaking server solution as one of our top priorities to take advantage of these exciting improvements.

QoL menu improvements for Lakka/RetroArch

Kivutar et al have recently pushed out a pair of major Lakka releases, which include the built-in RetroArch improvements, along with a number of Lakka-specific features/improvements. Users now have the ability to hide advanced settings in the XMB menu, which leads to a greatly simplified default menu in Lakka releases, as well as the ability to hide menu tabs such that only playlists are visible, which is ideal for appliance-style “kiosk” settings, where you don’t want children or other users to monkey around with your settings (these options are also available in non-Lakka RetroArch releases). Kivutar also added the ability to scan for and join wireless networks directly from the Lakka interface, as well as individual history tabs for RetroArch’s built-in multimedia cores, which include a video player, image viewer and audio visualizer (these cores have been built-in for a while but many people didn’t know about them; the history tabs should make them more accessible). You can read more about the Lakka releases here and here.

Fancy new menu graphics features

The XMB menu received some fancy new background shader effects, including an improved ribbon, a nice bokeh effect and a festive and nostalgic snow effect. We also rearranged and consolidated some menus to reduce clutter and hopefully make things easier and more intuitive for new users, and we merged ‘load content’ and ‘load content and detect core’ into a single unified function. We’ve also added secondary, explanatory sub-text to many menu items that should make them less mysterious, and a battery meter that should let mobile users avoid running out of juice unexpectedly.


Sony – PlayStation3/Vita

The Sony console ports have gotten some much-needed love, with frangarcj doing some really stellar work with the Vita henkaku port, which supports the fancy XMB menu, shaders, dynamic core loading, (mostly) full-speed PSX emulation via dynarecced PCSX-ReARMed and more, while Ezi0 has made great progress in getting the PS3 port back to usability.

Nintendo – Wii/WiiU/3DS

On Nintendo consoles, netux79 has been keeping up with the Wii port, adding support for USB gamepads and fixing some savestate issues with the snes9x-next core, while Twinaphex tracked down and fixed a black-screen bug that’s plagued the nightly builds since just after the 1.3.6 release. Aliaspider and Twinaphex have made a lot of improvements to the 3DS port, including fixing a persistent screen-tearing issue, and aliaspider has also made great progress with the nascent Wii U port.


On the core front, Twinaphex has greatly improved the error handling on many cores so that they are less prone to bringing the entire program down (i.e., segfault) when they choke. While this is certainly not as sexy or user-facing as many of the other improvements, it will lead to a more stable, user-friendly experience with fewer mysterious crashes that give no information as to their cause.

Twinaphex also ported Ryphecha’s unbelievably-awesome new Sega Saturn emulator to libretro (listed as “mednafen-saturn” in the online updater). Saturn has long been considered one of the most difficult and esoteric consoles to emulate, and Ryphecha deserves a virtual high-five for doing such a great job with it and generously releasing it under the GPL license. While it is currently only available on x86_64 platforms (i.e., no ARM, no 32-bit x86), its emulation quality and accuracy is already top-notch.

N64 emulation has had some exciting development, as well, with loganmc10‘s Glupen64-libretro port, which combines a shallow fork of mainline Mupen64plus (i.e., not the heavily modified fork that mupen64plus-libretro is based on) with gonetz’s crowdfunded GLideN64 RDP plugin. This core handles many hard-to-emulate effects as compared with the other HLE plugins and gets good performance on even modest hardware. As such, it has become the default core for N64 in the RPi3 spin of Lakka and also performs quite well in Android (and more stable there than regular mupen64plus-libretro, according to user reports). Meanwhile, Seru-kun added support for 64DD disks to Mupen64Plus-libretro, making it the second N64 emulator (after PJ64) to support the Japan-exclusive add-on.

Leiradel and meleu did a massive cleanup on the Retro Achievements/cheevos system, whereby all systems that RetroArch supports have at least one core that includes achievement support, and all systems supported by Retro Achievements is represented. Moreover, spurious achievements should no longer be awarded right when the game starts (a common problem, previously) and achievements will no longer be awarded in the event of a failure to actually meet the requirement(s).

Happy New Year!

We at Libretro wish you all a happy New Year! 2016 has been quite the year for Libretro as a project, so let’s briefly recap where we stand at the end of this year and what we managed to do in 2016 –

First with Vulkan

We were one of the first programs to ride the Vulkan wave, and we managed to add Vulkan support to RetroArch since Day One of the new graphics API’s release.

Continue reading “Happy New Year!”

Introducing Vulkan PSX renderer for Beetle/Mednafen PSX

I needed a break from paraLLEl RDP, and I wanted to give PSX a shot to have an excuse to write a higher level Vulkan renderer backend. The renderer backends in Beetle PSX are quite well abstracted away, so plugging in my own renderer was a trivial task.

The original PlayStation is certainly a massively simpler architecture than N64, especially in the graphics department. After one evening of studying the Rustation renderer by simias and PSX GPU docs, I had a decent idea of how it worked.

Many hardware features of the N64 are non-existent:

  • Perspective correctness (no W from GTE)
  • Texture filtering
  • Sub-pixel precision on vertices (wobbly polygons, wee)
  • Mipmapping
  • No programmable texture cache
  • Depth buffering
  • Complex combiners

My goal was to create a very accurate HW renderer which supports internal upscaling. Making anything at native res for PSX is a waste of time as software renderers are basically perfected at this point in Mednafen and more than fast enough due to the simplicity.

Another goal was to improve my experience with 2D heavy games like the Square RPGs which heavily mix 2D elements with 3D. I always had issues with upscaling plugins back in the day as I always had to accept blocky and ugly 2D in order to get crisp 3D. Simply sampling all textures with bilinear is one approach, but it falls completely flat on PSX. Content was not designed with this in mind at all, and you’ll quickly find that tons of artifacts are created when the bilinear filtering tries to filter outside its designated blocks in VRAM.

The final goal is to do all of this without ugly hacks, game specific workarounds or otherwise shitty code. It was excusable in a time where graphics APIs could not cleanly express what emulation authors wanted to express, but now we can. Development of this renderer was a fairly smooth ride, mostly done in spare time over ~2 months.


This renderer would not exist without the excellent Mednafen emulator and Rustation GL renderer.

Tested hardware/drivers

  • nVidia Linux/Windows 375.xx+ (works fully)
  • AMDGPU-PRO 16.30 Linux (works fully)
  • Mesa Intel (Ivy Bridge half-way working, Broadwell+, fully working, you’ll want to build from Git to get some important bug fixes which were uncovered by this renderer :D)
  • Mesa Radeon RADV (fully working, you’ll want to build from Git to get support for input attachments)

– But, but, I don’t have a Vulkan-capable GPU

Well, read on anyways, some of this work will benefit the GL renderer as well.

– But, but, you’re stupid, you should do this in GL 1.1 and hack it until it works

No 🙂

– Fine, but clearly this is just for shits and giggles

Doing it for the lulz is always a valid reason.


The source will be merged upstream to Github immediately.

PSX GPU overview

The PSX GPU is a very simple and dumb triangle rasterizer with some tricks.


The PSX has a 1024×512 VRAM at 16bpp, giving us 1MB of VRAM to work with. Interestingly enough, this VRAM is actually organized as a 2D grid, and not a flat array with width/height/stride. This certainly simplifies things a lot as we can now represent the VRAM as a texture instead of shuffling data in and out of SSBOs.

Unlike N64, the CPU doesn’t have direct access to this VRAM (phew), so access is mediated by various commands.


The PSX can sample textures at 4-bit palettes, 8-bit palettes or straight ABGR1555, very neat and simple. Texture coordinates are confined to a texture window, which is basically an elaborate way to implement texture repeats. Textures are sampled directly from VRAM, but there is a small texture cache. For purposes of emulation, this cache is ignored (except for one particular case which we’ll get to …).

An annoying feature is that the color “0x0000” on PSX is always transparent, so all fragment shaders which sample textures might have to discard, another reason to be careful with bilinear.

Shading options

PSX just has 3 shading options, which makes our life very simple:

  • Interpolate color from vertices
  • Interpolate UV and sample nearest neighbor
  • Sample texture multiplied by interpolated color (gouraud shading)

It is practical to not use uber-shading approaches here.


PSX has a weird way of dealing with transparency. There is no real alpha channel to speak of, we only have one bit, so what PSX does is set a constant transparency formula, (A + B, 0.5A + 0.5B, B – A, or 0.25A + B). If the high-bit of a texture color is set, transparency is enabled, if not, the fragment is considered opaque. Semi-transparent color-only primitives are simply always transparent.


Possibly the most difficult feature of the PSX GPU is the mask-bit. The alpha bit in VRAM is considered a “read-only” bit if mask bit testing is enabled and the read-only bit is set. This affects rendering primitives as well as copies from CPU and VRAM-to-VRAM blits.

Especially mask-bit emulation + semi-transparency creates a really difficult blending scenario which I haven’t found a way to do correctly with fixed function (but that won’t stop us in Vulkan). Correctly emulating mask-bit lets us render Silent Hill correctly. The trees have transparent quads around them without it.


Intersecting VRAM blits

It is possible, and apparently, well defined on PSX to blit from one part of VRAM to another part where the rects intersect. Reading the Mednafen/Beetle software implementation, we need to kind of emulate the texture cache. Fortunately, this was very doable with compute shaders, although not very efficient.

Implementation details

Feature – Adaptive smoothing

As mentioned, I prefer smooth 2D with crisp-looking 3D. I devised a scheme to do this in post.

The basic idea is to look at our 4x or 8x scaled image, we then mip-map that down to 1x with a box filter. While mip-mapping, we analyze the variance within the 4×4 or 8×8 block and stick that in alpha. The assumption here is that if we have nearest-neighbor scaled 2D elements, they typically have a 1:1 pixel correspondency in native resolution, and hence, the variance within the block will be 0. With 3D elements, there will be some kind of variance, either by values which were shaded slightly differently, or more dramatically, a geometry edge. We now compute an R8_UNORM “bias-mask” texture at 1x scale, which is 0.0 where we estimate we have 3D elements, and 1.0 where we estimate we have 2D. To avoid sharp transitions in LOD, the bias-mask is then blurred slightly with a 3×3 gaussian kernel (might be a better non-linear filter here for all I know).

On final scanout we simply sample the bias-mask, multiply that by log2(scale) and use that as an explicit lod in textureLod() with trilinear sampling, and magically 2D elements look smooth without compromising the 3D sharpness. Sure, it’s not perfect, but I’m quite happy with the result.

Consider this scene from FF IX. While some will prefer this look (it’s toggleable), I’m not a big fan of blocky nearest-neighbor backgrounds together with high-res models.


With adaptive smoothing, we can smooth out the background and speech bubble back to native resolution where they belong. You may notice that the shadow under Vivi is sharp, because the shadow which modulates the background is not 1:1. This is the downside of doing it in post certainly, but it’s hard to notice unless you’re really looking.


The bias mask texture looks like this after the blur:


Potential further ideas here would be to use the bias-mask as a lerp between xBR-style upscalers if we wanted to actually make the GPU not fall asleep.

There is nothing inherently Vulkan specific about this method, so it will possibly arrive in the GL backend at some point as well. It can probably be used with N64 as well.

Obviously, for 24-bpp display modes (used for FMVs), the output is always in native resolution.

GPU dump player

Just like the N64 RDP, having an offline dump player for debugging, playback and analysis is invaluable, so the first thing I did was to create a basic dump format which captures PSX GPU commands and plays them back. This is also nice for benchmarking as any half-capable GPU will be bottlenecked on CPU.

PGXP support

Supporting PGXP for sub-pixel precision and perspective correctness was trivial as all the work happens outside the renderer abstraction to begin with. I just had to pass down W to the vertex shader.

Mask bit emulation

Mask bit emulation without transparency is quite trivial. When rendering, we just use fixed function blending, src = INV_DST_ALPHA, dst = DST_ALPHA.

With semi-transparency things get weird. To solve this, I made use of Vulkan’s subpass self-dependency feature which allows us to read the pixel of the framebuffer which enables programmable blending. Now, mask-bit emulation becomes trivial. This feature is a standard way of doing the equivalent of GL_ARB_texture_barrier, GL_EXT_framebuffer_fetch and all the million extensions which implement the same thing in GL/GLES. For mask-bit in copies and blits, this is done in compute, so implementing mask bit here is trivial.


Copies in VRAM are all implemented in compute. The main reason for this is mask bit emulation becomes trivial, plus that we can now overlap GPU execution of rendering and blits if they don’t intersect each other in VRAM. It is also much easier to batch up these blits with compute, whereas doing it in fragment adds some restrictions as we would need to potentially create many tiny render passes to blit small regions one by one, and we need blending to implement masked blits, which places some restrictions on which formats we can use.

When blitting blocks which came from rendered output, the implementation blits the high-res data instead. This improves visual quality in many cases.

Being careful here made the FF8 battle swirl work for me, finally. I’ve never seen that work properly in HW plugins before 🙂 PGXP is enabled here with perspective correctness as well.

Intersecting VRAM blits

For intersected VRAM blits, I dispatch one 128-thread compute group which basically implements the C++ variant as-is. It emulates the texture cache by reading in data from VRAM into registers, barrier(), then writing out. This then loops through the blit region. It’s fairly rare, but this case does trigger in surprising places, so I figured I better do it as accurate as I could.

The Framebuffer Atlas – Hazard tracking

The entire VRAM is one shared texture where we do all our rendering, scanout, blits, texture sampling and so on. I needed a system where I could track hazards like sampling from a VRAM region that has been rendered to, and deal with changing resolutions where crazy scenarios like CPU blitting raw pixels over a framebuffer region which was rendered in high resolution. Vulkan allows us to go crazy with simultaneous use of textures (VK_IMAGE_LAYOUT_GENERAL is a must here), as long as we deal with sync ourselves.

In order to support high-res rendering and sampling from textures, I needed to deal with two variants of VRAM:

  • One RGBA8_UNORM texture at 1024x512xSCALE, rationale for RGBA8 is mandated support for image load/store, it has alpha (for mask bit) and good enough bit-depth. This texture also has log2(scale) + 1 mip-levels, so we can do the mip-mapping step of adaptive smoothing in place.
  • One R32_UINT texture at 1024×512. I actually wanted R16_UINT, but this is not mandated for image load/store. Here we store the “raw” bit pattern of VRAM, which makes paletted texture reads way cheaper than having to pack/unpack from UNORM.

I split the VRAM into 8×8 blocks. All hazards and dependencies are tracked at this level. I chose 8×8 simply because it fits neatly into 64 threads on blits (wavefront size on AMD), and the smallest texture window is 8 pixels large, nice little coincidence 🙂

Each block has 2 bits allocated to it to track domain ownership:

  • Block is only valid in UNSCALED domain.
  • Block is valid in both UNSCALED and SCALED, but prefer SCALED.
  • Block is valid in both UNSCALED and SCALED, but prefer UNSCALED.
  • Block only valid in SCALED domain.

Whenever I need to read or write VRAM in a particular domain, I need to check the atlas to see if the domain is out-of-sync. If it is, I will inject compute workgroups which “resolve” one domain to the other.

If the access is a “write” access, the block will be set to “UNSCALED only” or “SCALED only” so that if anyone tries to access the block in a different domain, they will have to resolve the block first.

To resolve SCALED to UNSCALED, a simple box-filter is used. In effect, at 4x scale we get 16x supersampling, or 64x SSAA at 8x scale 😀 To resolve UNSCALED to SCALED, nearest neighbor is used. The rationale for doing it this way is that resolving up and down in scale is a stable process. Using nearest neighbor for up-resolves also works excellent with adaptive smoothing since we will get a smoothed version of the block which was resolved from UNSCALED and wasn’t overwritten by SCALED later.

Another cool thing is that I use R32_UINT in the UNSCALED domain, so I actually pack in 10-bit color on resolve. Regular ABGR1555 goes in the lower 16-bits, but the upper 16 bits are used to hide “hidden” precision bits, which are used if the texture is ever read as straight ABGR1555. This greatly improves image quality on any framebuffer effects without having to resolve with dithering and keeps palette reads efficient.

One very interesting bug I had at some point was that the Silent Hill intro screen wouldn’t fade out, it turns out that it samples the previous frame with a feedback factor of ~0.998! We have to be very careful here with rounding, the problem I had was that at slightly darker tones round(unorm_color * 0.998) == unorm_color, so just a smear instead of fade out, especially since the main framebuffer was just 5-bit per color … The fix here was to try to mimic rounding modes closer to what PSX does on 8-bit -> 5-bit, simply chopping away LSBs, now it looked correct. Using 10-bpp resolves improved things a bit more.

Now, even though I can sync between domains, I still need to track write-after-write, read-after-write and write-after-read hazards within a domain. Whenever a GPU command reads or writes from an 8×8 block, I check to see if there are any pipeline stages which have also accessed the pipeline stage in a way which is a hazard. E.g. a write will need a pipeline barrier if there are any readers or writes, but a reader only needs to check if there are writers. If such a hazard is detected, a callback is signalled with which pipeline stages need to participate in a vkCmdPipelineBarrier and which caches need to be flushed/invalidated on the GPU. The bits are then cleared. Overall, I make due with 16 bits per block, which is very compact.

While this scheme is fine for blits and copies and whatnot, renderpasses are handled a bit special, instead of checking the atlas for every primitive, the bounding box of the render pass itself is considered instead. Only when the bounding box of the renderpass increases does it damage the atlas and resolve any hazards which may arise.

Render passes are always done in-order, so hazards between render passes are ignored in the atlas for simplicity. Overall, the performance of this approach turned out to be great, and seems to be very accurate for the content I’ve tested.

The atlas implementation is API agnostic, so hopefully this should fix some bugs in the GL renderer as well if integrated.

Render pass batching

PSX renders one primitive at a time, so it is quite obvious that we need to aggressively batch primitives. There is another side here which is important to consider, for tiled-based renderers on mobile, each render pass has a very significant cost in that beginning/ending the render pass needs to read-in/write-out all memory associated in the render area, which is a quite large drain on performance. PSX games tend to make it difficult for us as clear rects come in, scissor boxes change and we need to batch as aggressively as we can here. Not all games use clear rects, so using loadOp = CLEAR isn’t always an option.

The approach I took is very similar to the Rustation renderer, but with some extra considerations for Vulkan render passes.

As a primitive comes in, it is placed in one or more queues:

  • Opaque non-textured primitives
  • Opaque textured primitives
  • Semi-transparent textured primitives
  • Semi-transparent primitives (including textured) and primitives which use mask bit

The screen space bounding box is computed for this primitive. Using this information we can figure out if the primitive is “scissor invariant”, i.e. the scissor box cannot clip any pixel in the primitive. If this is the case, we can say the primitive belongs to scissor instance -1. If the primitive can be clipped, we assign the primitive a scissor index. The scissor index increases whenever set_draw_area() changes the scissor box. From here, we enter the atlas to see if the union of the scissored bounding box and existing render pass area increases, if it does, we need to check for hazards to avoid any synchronization issues. If this happens, we flush out the render pass, synchronize and start a new render pass.

For clear rects, we similarly expand the render area as needed. If the current render area == clear rect area, this becomes a clear op candidate. If the render pass is later flushed with this particular area, we know we can use loadOp = CLEAR and save lots of readbacks on tiled GPUs, yay.

When the render pass is flushed, we render out our queues in a particular order. While the PSX GPU does not have depth buffers, it doesn’t mean we cannot use depth buffering ourselves to sort primitives in a more favorable order.

Opaque primitives are sorted by scissor index, and then front-to-back. These are rendered first. Then, we consider the semi-transparent textured primitives, these are conditionally semi-transparent. Just like Rustation, we render the primitives as if they were opaque, and discard the fragments if they are indeed opaque. If they are opaque, we end up writing the primitives Z to the depth buffer, serving as a mask (depth test = LESS) when we later redraw these primitives again a second time.

Now that we have sorted out the opaque pixels, we render the semi-transparent primitives in-order, batching up as many primitives as we can depending on the VkPipeline they need to use. While some crazy reordering can be done here if primitives don’t overlap, I doubt it’s worth it.

Primitives which use mask bit and semi-transparency are always drawn alone, because we need to perform a by-region vkCmdPipelineBarrier(COLOR_ATTACHMENT -> INPUT_ATTACHMENT) to safely read the framebuffer. This is quite expensive on IMR GPUs, but performance is just fine in the prime example of this PSX feature, which is Silent Hill. On tile based GPUs, this is basically free though, so that will be interesting to test in the future. 🙂

An important case where having a tight bounding box on our draw area is the MGS codec, generated from a frame trace in rsx-player:

The “bloom” effect is done by rendering the codec text to the lower left, then blend it with offsets on top, effectively creating a gauss kernel (!?) If we used the draw rect naively as the bounding box, we would create 13-15 render passes just to draw this thing as the hazard tracker would think that we rendered to a framebuffer while also trying to sample from it at the same time, one render pass for each blend step.

Line rendering

Line rendering is always a PITA. PSX has a very particular rasterization pattern which games sometimes rely on to draw primitives correctly, you may have noticed the one pixel that was wrong in the video above … ye, it’s using lines, go figure. The current implementation generates a quad which tries its best to approximate wide lines to match the rasterization pattern of PSX, but it’s not quite there yet.

Vulkan higher level API

This time around I wanted an excuse to create a higher level Vulkan API. The Vulkan backend in the RDP is a bit too explicit in hindsight and adding things like VI filtering would require a ton of boilerplate crap to deal with render passes etc … so, this time around I wanted to do it better.

I’m quite happy with the API as a standalone renderer API and I hope to reuse this in the RDP and other side projects when I get back to that.

Reusable PSX renderer implementation

The renderer exposes a C++ API which closely matches the PSX GPU. It should be fairly straight forward to reuse in any other PSX emulator or maybe even used as a renderer for a retro-themed game which tries to mimic the look and feel of early 3D games.


As you can expect, performance is good. The better desktop GPUs easily render this stuff at 8K resolution if you’re crazy enough to try that. In more modest resolutions, 1000++ FPS is easy, you’re going to be CPU bound in the emulator anyways, might as well crank it as high as it’ll go. The atlas hazard tracking doesn’t seem to appear in my profiles, so I guess it’s fast enough.

Interestingly enough, I was worried that VK_IMAGE_LAYOUT_GENERAL would decimate performance on AMD, but it seems just fine, guess I’m not bandwidth bound. 🙂

Enabling this renderer in Beetle

Make sure you enable the Vulkan backend in RetroArch. Beetle will now try multiple backends until one of them succeeds, the final fallback is software.


While I haven’t tested every game there is, I think it’s quite solid already, in far better shape than paraLLEl RDP is at least. The bugs I know of so far are all minor visual glitches which are likely due to either upscaling or slightly off rasterization rules.

Source code repository

The source code repository to Mednafen/Beetle PSX can be found here –


We are now on Patreon!


The Libretro Project (comprised of Libretro, Lakka, and RetroArch) is now on Patreon! We hope this Patreon will enable us to accelerate development and be able to serve users in lots of benevolent ways!

Visit us here:

This Patreon covers the Libretro, Lakka, RetroArch projects. And another, soon to be disclosed project as well.

Right now we are at $230 as of this minute. We thank every Patron so far that has helped us get to this stage in such short time, suffice to say you won’t be let down! Let’s go over some of the goals as they stand!

$150 – Bounty for core work every month! Reached!

Already the $150 goal has been reached which will allow us to place bounties for core work to be done! We allocate a total of $50 / month that will go towards bounties.

$200 – ProjectFuture Greenlight! Reached!

I will be revealing soon what this project is about. Let’s just state it’s going to be an even bigger and more expansive project than RetroArch has been so far, and it’s one of the main reasons why we finally went ‘why not?’ with regards to the Patreon. Stay tuned!

This is going to take months and months of work, and will take other considerable resources in order for us to be able to see it to completion, and it’s definitely one of those ‘flying very close to the sun’-type endeavors, but as with everything with this project, ‘dreaming big’ and ‘foolhardy’ are comfortable bedfellows.

$400 – Netplay/matchmaking server!

We want RetroArch users to be able to play online multiplayer games with each other through the RetroArch interface. We are going to allow for PSN/XBLA-like features, except free of charge! The prospect of true crossplatform free netplay from an easy and console UI-style interface is soon to be within reach once we hit this target!

The aim is that every user will be able to quickly and easily setup a netplay game from within RetroArch without the need of a keyboard/mouse! We want console-style netplay ease of use !

$500 – Stability checks, Quality Assurance, etc!.

It’s no secret that for years we have relied on volunteer work in order to get where we are. This entire project entails a maddening amount of work that we have to put in on a daily basis to keep the entire show up and running, and the amount of work keeps growing every time we add another platform port or add a new core.

Once we hit our $500 target, we are going to be paying a couple of developers whom have been loyal towards the project to keep tabs and checkups on RetroArch and various libretro-related cores on a bi-monthly basis. This way, bugs and regressions are easily spotted and we can instantly fix them.

$600 – Development bounties!

We are going to be posting bounties for various remaining issues (whether it be RetroArch or cores), and any developer will be able to fix these issues and claim the reward!

Finally we can start claiming bounties for some of the things that RetroArch and Lakka might still be missing! Good developers don’t grow on trees, neither do contributors. We hope that through these bounties we will be able to significantly improve the software and get to our goals much quicker!

NOTE: The amount of money that will be allocated for this is variable and decided at our own discretion.

A RetroArch retrospective and what to look forward to

I have been following the events on a few libretro related threads in reddit and I find it quite disappointing to see the amount of hatred directed to a project that has done nothing but do what end-users wanted for more than three years now. I also find it terrible (but interesting none the less) that the social media post is more active than the actual highlight.

Disclaimer: this represents my own experiences and my points of view with regard of the situations that surround our project.


A bit of my personal history with the project:

Let’s look back all the way to 2013. RetroArch was still called SSNES, a fairly small commandline program with just a few cores, a launcher that could be used to adjust options and that’s it. No bells or whistles, just a few nice cores implemented under one frontend with a common feature set. I hadn’t really been using emulators since the zSnes days other than a few tries with mobile emulators on my WinCE device.

I just had built a game-room / tv-room. So I setup XBMC and loved it. Soon I started looking for emulators that would work nicely with my setup. I installed Nestopia and some XBMC plugin that acted as a launcher with worked mostly fine. I liked the emulation but I also like the fact that I could set hotkeys to save, load, and it presented nice OSD messages on non-game actions and I could drive the whole thing with my gamepad only. I hoped other emulators would have the same features but I was let down almost instantly. Regardless I pursued my objective with a miriad of tools (Pinnacle Game Profiler, Xpadder, Joy2Key, batch files, Daemon Tools to name a few).

Continue reading “A RetroArch retrospective and what to look forward to”

Mednafen/Beetle PSX – PGXP arrives!

Mednafen/Beetle PSX has made another significant stride forward! iCatButler has contributed a working backport of PGXP for Mednafen/Beetle PSX.

PlayStation rasterization issues

Several issues can be noticed in most PlayStation games’ graphics.

Continue reading “Mednafen/Beetle PSX – PGXP arrives!”

RetroArch Web Player

An Emscripten port of RetroArch has existed for years, but until recently, we never had a good opportunity to launch it in a state we felt comfortable with. Well, until now that is.

Web Player

So what is RetroArch Web Player? It’s a port of RetroArch that runs inside your web browser, powered by emscripten and asm.js. Most modern browsers available today should be compatible. That being said, we strongly recommend you use Google Chrome right now for smooth v-synced gameplay with no audio crackling.

You can check it out right here!

Continue reading “RetroArch Web Player”

paraLLEl RDP and RSP updates (September 2016)

Unfortunately, I haven’t had much time to work on paraLLEl lately, but there is plenty to update about.

paraLLEl RSP – Clang/LLVM RSP recompiler experiment

Looking at CPU profiles, paraLLEl RDP could never really shine, as it was being held back by the CXD4 RSP interpreter, so groundbreaking speedups could not be achieved. With paraLLEl RDP, the RSP was consuming well over 50% CPU time. This was known from the beginning before RDP work even started. After the first RDP pre-alpha release, focus shifted to RSP performance, and that’s what I’ve spent most time on. None of my machines are super-clocked modern i7s, which have been required to run N64 LLE at good speed.

Micro-optimizing the interpreter is a waste of time, I needed a dynarec. However, I have never written a dynarec or JITer for that matter before, and I was not going to spend months (years?) learning how to JIT code well for ~4 architectures (x86, x64, ARMv7, ARMv8). Instead, using libclang/libllvm as my codegen proved to be an interesting hack that worked surprisingly well in practice for this project.

Continue reading “paraLLEl RDP and RSP updates (September 2016)”