One day I started thinking, "I should have a good marketing website for all the Patreon projects that people can send to game developers." You know, a sort of elevator pitch you-can-trust-me-with-your-source-code thing.
...then I decided that this was actually a good idea, because I'm stubborn and foolish in equal measures.
I took my silly little 2D engine and started building a little demo thing (this is incomplete, be gentle), which you can see here. Left and right arrow keys move, clicking "TAKE ME THERE" doesn't take you anywhere at all yet. Don't make fun of me, I'm just trying out an idea.
And I thought, some bigwig CEO of GameCorp is going to get an email from a Linux user at lunch that says "you can see some things Ryan did at this link," and he's going to tap that link on his phone, so I needed to make sure it works with iPhones, and, uh...it did, at like 1 frame per second.
I'm thinking, hmm, is it that expensive to decode Ogg Theora in asm.js? So I turned off the movie. Not faster. I turned off the "DEVELOPMENT BUILD" box at the bottom, though, and it got noticeably faster, and aha, I knew what the problem was.
Each letter on that string of text in the blue box is a separate textured quad, rendered through SDL's 2D render API.
I had originally gone with SDL's renderer because like everything else in SDL, it just does the right thing on every platform. You're on Windows and need Direct3D. No problem. Linux and need OpenGL? Cool. macOS and Metal? Done. A friggin' PlayStation Portable? Yeah, we got that.
Don't have it? No worries, we even have a software renderer. :)
No matter what platform you're on, SDL's renderer just does the right thing. It just happens to do it veeeeeeery inefficiently.
Desktop computers, with their big big processors and beefy GPUs, are pretty forgiving about inefficient rendering. But if you want a truly pathologically unforgiving case, try Safari on iOS. This webpage does ~110 draw calls per frame, using SDL 2.0.8's testsprite2.c demo. Load it on an iPhone or iPad; I'll wait. 110 draw calls.
So just that blue box on the portfolio is 132 draw calls, before you render a movie, or any graphics, or lord knows, any of the other text. My phone can run Infinity Blade 3 without a hitch, but it can't draw 100 smiley faces? Obviously something needed to change.
The first clue was written right at the top of the source code for our GLES2 renderer:
/* !!! FIXME: Emscripten makes these into WebGL calls, and WebGL doesn't offer client-side arrays (without an Emscripten compatibility hack, at least), but the current VBO code here is dramatically slower on actual iOS devices, even though the iOS Simulator is okay. Some time after 2.0.4 ships, we should revisit this, fix the performance bottleneck, and make everything use VBOs. */
(Fortunately, we are now living "some time after 2.0.4 ships," heh.)
For those that aren't OpenGL-savvy, it should be a huge red flag that Vertex Buffer Objects are slower than Client-Side Arrays. This generally suggests a problem in the app's code and not the GL. It's also telling that this was only a problem on iOS and not its Simulator (which runs on a desktop CPU and GPU). Of course Safari on iOS was having problems; the same code, run natively, was too.
It didn't take long to figure out why: every single smiley face was replacing a few bytes in a single vertex buffer. It's likely that the GL was stalling while it transferred those bytes to the GPU in between each draw call, waiting until that vertex buffer was available for reuse again. Stall, stall, stall, about 110 times per frame, and it adds up quickly.
Back in ancient times, I worked to port Haaf's Game Engine. I remember when I first saw it, I noticed it did something interesting: it batched up all its draw calls as long as possible. Like this. In English: the app said "draw a rectangle like so, I don't care how you do it" and the engine says "this rectangle is like other rectangles you want me to draw, so I'll just add it to an array for now." (or: "this rectangle is not like the others, so send those others to the hardware and empty the array out, so we can start a new array with this different rectangle.")
Haaf's array wasn't just any old array, though, it was a locked Direct3D vertex buffer. By holding it open as long as possible, he would batch the drawing commands, storing them up and then doing several at once, even though his engine's API makes it look like each draw happened immediately.
I thought that was clever, and I wondered if the same idea could hold if we batched up not just a few calls that all used the same texture or whatever, but every single draw in a frame.
So I ripped up the GLES2 renderer to do just this. Every thing that looks like a rendering operation (not just draws but setting the viewport, the cliprect, etc) goes into a linked list, which becomes our "command queue." The app is just specifying things to draw and moving on, but behind the scenes, the implementation was just appending another command into the queue and returning immediately.
When forced to run the queue (the frame is done, the render target is changing, we want to change the contents of a texture that is waiting to be used in this queue), the implementation wouldn't update a vertex buffer a bunch of times, it would make one, massively-hulking vertex buffer and do all the draws out of it. The theory is you have a single big upload to the GPU, but the problem isn't bandwidth but the time it takes to do a transfer at all, so doing it all at once is a huge win.
This was clearly a concept that works.
I ripped all this code out of the GLES2 renderer and moved it into the higher level. Now all the backends use this technique. Even the immediate mode OpenGL renderer! It turns out when you do it all in a batch, even if you draw in very slow ways, you can still avoid a ton of state changes, because you know how you left things earlier in the same function. :) I started aggressively caching draw state and not resetting anything I could avoid.
The way this works now is that SDL's renderers implement methods for queuing new commands: SDL says "queue up a new textured quad," and the backend says "give me 128 bytes in the vertex buffer" and SDL makes sure memory is available and hands it off. The rendering backend fills in whatever data it needs and SDL adds a command to the queue with those buffer offsets and life goes on. When running the queue, the backend moves that buffer to the GPU and draws with the appropriate offsets.
SDL can deal with alignment issues (Metal even puts things like new viewport coordinates in the vertex buffer, but they need to be aligned to 256 bytes...blame Nvidia!); SDL will manage the buffer gaps it left when aligning data and try to fill them in on future draws, so the final vertex buffer might not be in order, but will try to be optimally packed.
So, even on those desktop PCs with the forgiving GPUs, performance went up, across all backends, by 2x to 5x. 20,000 smileys on Direct3D 11 went from 70fps to 240. OpenGL went from 40fps to 120. Metal went from 80 to 190. And so on.
Software rendering hasn't changed much, but there's so much low-hanging fruit in there we could exploit, now that we have a command queue; some day, I think it would be interesting to experiment with a span buffer.
So this is all good news, but brings one concern: there are games that use SDL's renderer, but also want to make their own additional OpenGL (or whatever) calls, possibly to supplement what SDL itself offers. We have a plan for this!
First: if you create a renderer but don't ask for a specific one (either you called SDL_CreateRenderer() with a -1 index, meaning "don't care," or you called SDL_CreateWindowAndRenderer() which doesn't give you an option), we will batch as much rendering as we can. You just do what you've always done and it works faster.
If you asked for a specific renderer, we turn off batching by default, because we can't be sure you aren't going to want to operate on the results of the SDL rendering code, or even just doodle on top of it. To turn off batching, we just implicitly flush the queue after each draw call, so it works more or less like 2.0.8 always worked. But don't worry: even in this case, we are still markedly faster than 2.0.8's renderer, because a lot of internal inefficiency was cleaned up!
Now, maybe you asked for a specific renderer because your engine has a cvar that a user set, or you just find OpenGL is more stable than Direct3D for your users or something like that, but you otherwise don't touch those APIs directly. For this case, you want to set an SDL hint before creating the renderer...
...and we'll trust you know what you're doing. It's not a bad idea to set this in any case, as a promise not to pull off the warranty-voiding sticker, in case your users force the renderer themselves with an environment variable.
Now, if you're really careful, you can use batching and the lower-level rendering API at the same time. Set that hint and call the new SDL_RenderFlush(SDL_Renderer *) between any rendering SDL does and any rendering you want to do. That way you get the fastest rendering path through SDL, but can guarantee everything makes it to the GPU in the right order. But honestly? Most people can just keep using SDL the way they always have; this will still be backwards compatible and faster than what you currently have, no matter what your needs are.
And that's all! This was a lot of work, and I'm thrilled with the end result. This project is sitting in the "SDL-ryan-batching-renderer" branch in revision control; we're about to ship SDL 2.0.9 and it's way too late to drop a change this huge into it, but it will merge thereafter. I would appreciate it if people would test their games against it, as it's all over but the debugging at this point. :)
Now that this project is done, it's back to porting games, so I can have more titles that I can some day put on that goofy little portfolio app. :)