I thought I’d explain what all this is about in more detail, since I mentioned it in the December wrapup. This is a long and technical post about weird low-level stuff and you can totally skip this if you don't care about weird low-level stuff.
I've been building a project called “2ine,” pronounced “twine,” which emulates OS/2 binaries in the same way that Wine emulates Windows binaries. I named it as such to continue the fine tradition of programmers producing good code and bad product names.
Any introduction to computers starts with a whole lot of lies. It’s complexity all the way down to the bottom, and you don’t need to know it all when you just want to write a program that prints out “hello world.” This is a blessing and a curse, but mostly a survival mechanism; if you knew everything you were about to stumble into, you’d never start. In this way, I stumbled into writing an OS/2 emulator.
In 2003, when I was working on what would become Unreal Tournament 2004, I was trying to get the software renderer working on Linux. The market was rotten with fast CPUs and lousy GPUs, and OpenGL support on Linux was spotty back then in any case, so having a software fallback was super useful. For a software renderer, UT2004 uses RAD Game Tools’ Pixomatic, which is one of those amazing pieces of code that seems to defy physics.
Pixomatic is itself a technical marvel of lowlevel x86 sorcery; Michael Abrash wrote about it in detail here and here and here. But of all the things it can do, all you need to know for now is that the “Linux port” of Pixomatic consists of a Windows .dll and a small piece of C code to load it.
I’m not kidding.
All Pixomatic cares about is lighting up pixels in a memory buffer and only needs the win32 API to manage pages of memory. So this C code would get the Windows DLL loaded, provide a few function pointers that would call mmap() when Pixomatic thought it was calling VirtualAlloc() and such, and then everything just sort of works on Linux. At the end of our trip to Win32 Land, UT2004 would use SDL to flip that memory buffer to the screen as if nothing unusual happened here at all. It’s wild.
What surprised me about this wasn’t the approach—although that was unexpected too—but the simplicity of the C code that loads the DLL. 618 lines of code! Is this all a DLL needs?
The answer, as always, is yes and no.
Shared libraries on most operating systems work like this: you get a table of information of where to place data in memory and the system loader blasts it out there. There’s more to the effort than that, but the juiciest tasks are:
And that’s it! In most cases, programs you run use the same format as shared libraries with minor differences. (Linux users! Did you ever try to run libc.so.6 like it was a program instead of a shared library? I’ll wait while you try it.)
Pixomatic’s freakishly simple DLL loader got stuck in my head. Once you do the loading and fixing up, this isn’t really a win32 program at all any more. It’s code in a Linux process. Could we load other things like this? You bet.
Ten years later, I stumbled into writing an ELF loader for Linux, because dlopen() needs a filename and I wanted to load shared libraries from a memory buffer. ELF is more complicated than the Windows format, and this code is more robust in general, getting us to about 1200 lines of code, in a project I called MojoELF.
Once I had built that, I thought: if I load an ELF binary on a Mac, it’s no longer a Linux program. It’s code in a Mac process. And since the Mac has all that POSIX goodness and a quality SDL port, if we fix up the POSIX calls to function addresses that bridge differences in data layout, and fix up calls into SDL to the real Mac SDL library, well…maybe we can play a Linux build of Quake 3.
None of this is novel at all: this is roughly how Wine has always done things. They just had to work harder to deal with system calls into an OS that’s completely alien on Linux.
But in any case, this is an idea that works.
I don’t know what prompted me to do an OS/2 loader in the first place, but it was probably a straightforward case of nerd-sniping. I keep a list of interesting-waste-of-time projects and sometimes when my mind wanders, I foolishly look at this list and my productivity drops to zero until I can build some ridiculous thing that caught my eye.
Documents that explain the OS/2 file format (the binaries are called Linear Executable format, LX for short) are easy to find on the Internet, so why not try? It should send up alarms when I tell myself “Why not spend an hour and see how far you get?” There should be a Degrassi High School episode about this exact scenario, to serve as a warning to future programmers during their formative years!
Like that Pixomatic C code, loading an LX binary into memory is mostly easy. But even if you discount the ugly corner cases, you still have to implement the OS/2 API for the loader to be useful.
The first version of my OS/2 loader ran exactly one program, a Hello World thing, written in assembly because loading a C runtime was too complex at this point.
(Strictly speaking, the true first version was probably a program that set EAX to 42 and returned, just to see if the process exit code was also 42 when 2ine terminated, demonstrating we could bounce into OS/2 Land and back safely.)
OS/2, like Windows, doesn’t offer a single C runtime like Linux and macOS do, and all of them do some complicated tap dancing with system calls before main() even runs, and I didn’t want to mess with that yet. Eventually, I got to filling in some basic Unix-like bits, with the naming OS/2 uses: DosOpen() to open files, DosWrite() to write to a file handle, etc. The nice thing about DosWrite() is that file handles 0, 1, and 2 match up with Unix stdin, stdout and stderr; this helped get a bunch of OS/2 command line programs running without added drama, and you can even pipe them through to other Linux processes.
System APIs are written in C, as native Linux code. When the OS/2 module reports that it needs the (system-provided) DOSCALLS.DLL, 2ine dlopens its own libdoscalls.so with these reimplemented APIs, and uses the native Linux entry points to fix up the OS/2 module. Now when the app calls DosWrite, the CPU calls directly into a Linux ELF shared library where this function was reimplemented, not knowing the difference. The 32-bit calling conventions happen to match up well enough between the two platforms that it just happens to work.
An OS/2 app can run under 2ine using a mix of native Linux libraries that reimplement system APIs and real OS/2 DLLs, so long as those DLLs don't do weird things or depend on DLLs that do weird things (what qualifies as a "weird thing" could fill a whole other blog post, though). Right now, some things will run on 2ine if they have access to a handful of IBM's system DLLs with native libraries spackling in the cracks, that otherwise would fail to operate.
With some effort, and with enough APIs filled in, the OS/2 port of GCC that I used in high school (2.8.1! It doesn’t even support Pentium 1 instructions!) started running, since it doesn’t need much more than stdio and a way to launch child processes, allowing me to build OS/2 programs on Linux, using the compiler under my emulator. Now we’re getting somewhere!
And then I thought, hey, let’s get Watcom C working too, and here my troubles began. Specifically, my troubles began with the compiler’s help screen saying “press return to continue.”
(That’s right, you can debug OS/2 binaries on Linux with GDB by running them under 2ine, as long as you don’t expect debug symbols or source code views! Also, as long as you can convert 16:16 pointers to a linear address in your head!)
OS/2 2.0 was a 32-bit operating system. 1.0 was not. Many of the APIs from 1.0 survived into the 32-bit transition, but they never got converted to 32-bit APIs themselves, even in the final 4.5 releases, years later. I suppose this was because IBM wanted developers to write Presentation Manager (the new GUI/window functions) programs instead of VIO (text mode/command line) programs. APIs for things like file management and thread primitives continued to exist as 16-bit APIs while also adding new 32-bit entry points for the same functions, but things that dealt with text-based programs (Vio* for writing to a console, Kbd* for keyboard input, etc) never got 32-bit equivalents.
(The mythical PowerPC port of OS/2 fixed this, making these APIs 32-bit clean, apparently, but these features never returned to the Intel port.)
Now one could definitely write a 32-bit command line program on OS/2, but if one needed to call these older system functions, one had to call into 16-bit code. This is done through the magic of thunking and some wizardry with memory segments. Imagine my surprise when Watcom C would print out a page of command line information, then wait for a keypress with KbdCharIn(). To do this, it would jump into a 16-bit code segment, which would save off some registers and call the never-updated-for-32-bit API call, restore some registers afterwards and jump back to 32-bit land with the results.
First problem: I don’t have a 16-bit code segment! Second: I don’t have a way to generate 16-bit code with GCC.
After some googling, I found there’s a Linux-specific system call to help with this, which Wine and dosemu use to support Win16 and MS-DOS. It’s called modify_ldt(), and it lets you map pages from your 32-bit linear address space to a 16-bit selector. LDTs are a feature of the x86 processor; you can read up on them on Wikipedia. Operating systems rely heavily on them, and userspace code doesn't unless it's doing wacky things like emulating ancient OSes.
Okay, now I can create 16-bit segments, so what segments do I create?
If you’re OS/2 2.0, the answer is: all of them. OS/2 would “tile” the entire address space, so any 32-bit pointer you might have automatically exists in some 16-bit segment, and you could do some simple math on the pointer itself to determine it (shift the top 16 bits left by 3 and bitwise OR with 7 to get the selector, bottom 16 bits are your offset).
The problem with this approach is that you only have 8192 possible selectors (you only get to use 13 of the bits!), times 64 kilobytes in each segment, which means 32-bit OS/2 apps can only access the first 512 megabytes of their address space in this system. In IBM’s defense, if your machine had more than 4 megabytes of total physical RAM at the time, that was a powerhouse computer.
Later versions of OS/2 stopped tiling like this, and offered an API to do the conversion for you (“DosFlatToSel”), but lots of programs rely on tiling and do the pointer math themselves without using the API. In hopes that this matches what OS/2 ended up doing, I tile the main thread’s stack (since this is probably where most data you want to reach in 16-bit code lives; temporary local variables for an API call) and any memory segments in the LX module that were marked as 16-bit. Non-tiled LDTs are then allocated when a DosFlatToSel() call is made, using cached selectors from previous allocations and tiles when possible. So far, it’s working out okay and we aren’t limited to 512 megabytes of memory, under the assumption most 16-bit calls happen at a handful of locations and the things that assume the pointer math works only try it in unsurprising ways. Knock on wood.
Now I probably have the address space politics worked out (minus Thread Local Storage, which does something bonkers I'll explain some other time), but I still can’t generate 16-bit code with GCC. The solution is not to. Just because the OS/2 app wants to call a function in a 16-bit code segment doesn’t mean we need the function to be 16-bit code. All we need is a little bit of bridge code for the OS/2 app to land in that moves us into our native implementation. As an added benefit, it means these APIs are usable to OS/2 apps recompiled from source as native Linux apps with no 16-bitness at all; think Wine vs Winelib.
So you get some macro salsa to define functions:
As you can see, this writes the bytes of the 16-bit x86 instructions directly to a memory buffer, instead of trying to get GCC to assemble them. The macro is used once for each “16-bit” API we export. The code got assembled with Netwide Assembler, then disassembled with ndisasm and pushed through a perl script to produce this code.
And then, like doing skateboard tricks, there’s nothing left to do but say “watch this,” and see if you pull off something awesome or just crash.
This is the sort of inefficiency and trouble that drives engineers mad, but here’s how we kept the CPU happy through this process:
One more piece of magic for 16-bit support: when writing x86 Linux code, you probably don’t think about your 32-bit linear address space as having a “code segment,” but it does. It’s not guaranteed to be any specific value, but is currently hardcoded (0x23 if you’re running a 32-bit app on an amd64 kernel, 0x73 if you’re on a real 32-bit kernel). You have one because code segments are how x86 processors keep track of code privilege level; the kernel runs in a different segment with higher privileges, which lets it have instructions that your userland code can’t use.
OS/2 also has a hardcoded code segment for 32-bit code, too. It’s 0x5B. I spent a week trying to get IBM’s command line FTP.EXE client to not crash because it wants to read passwords from the keyboard without echoing them to the screen, and that needs a 16-bit API, even though all the rest of the input is just a 32-bit DosRead() on stdin. After much head scratching about why it was trying to jump back from 16-bit land to a totally bogus 32-bit code segment, I found an ancient IBM CourseWare document on the Internet Archive that explained this. Since the OS/2 kernel hardcoded code segment 0x5B, IBM’s CSet/2 compiler hardcoded it into a bunch of apps, too, to get back to 32-bit land. EMX (which was basically an OS/2 port of GCC) was smart enough to save off the CS register and not do this, avoiding this problem.
2ine can’t map code segment 0x5B; it’s a GDT entry, not an LDT entry, which you can’t really mess with in userland, so the best we can do is sniff through 16-bit code segments we load from an OS/2 binary for far jumps to that segment and fix them up. That code is dirty-nasty-gross, though.
With that fix in place, and enough implementation of TCPIP32.dll, we were really flying now.
There were other fixes to be made, and features to implement, but now that most of the 16-bit drama was handled, and most of the “fun” problems with command line apps were done, it was time to move on to Presentation Manager apps, which generally don’t make 16-bit calls at all. The problem here isn’t binary compatibility but that Presentation Manager is a massive API surface that I’d have to write from scratch.
My pep-talk sounds a lot like this, echoing in my dark office at 2am: “this had to run on 386 machines with 2 megabytes of RAM and was built with caveman-primitive development tools. It couldn’t be that complex.” Or, more succinctly: “Simplicity encourages speed.” Often when trying to imagine how some system was implemented, in 1992, I tried to imagine the cleanest, easiest way to write it, and prayed that’s actually how it went down at Big Blue, too. This is straight-up cockeyed optimism on my part.
I tried to write the simplest PM program possible. I can’t even call it a “hello world” app because rendering text is an extra layer of complexity. Instead, I went for something that creates a 100x100 pixel window, at screen coordinate (100, 100). It paints it white when it needs painting, and if you click on it, it quits the program. Here’s the code:
The function names are different, but this should look familiar to you if you’ve ever done any Windows programming at the win32 (or win16) API level. It quickly becomes apparent that Windows and OS/2 started with the same API and drifted apart as their parents slid into divorce.
That program looks like this, running on OS/2:
(“Netscape Communicator” is not the most obsolete web browser installed on this system, believe it or not.)
See that white square? That’s our app! If you’re wondering why it’s so low on the desktop, OS/2’s coordinate system puts (0, 0) at the lower left corner, so that’s 100 pixels from the bottom, not the top.
To get this working, I just need to implement 14 functions! Unfortunately, some of these functions are hella complex. WinCreateWindow(), for example, is basically the core of the entire paradigm, so there are tons of arguments that do different things, flags that alter behavior, etc. WinGetMsg() needs to produce hundreds of possible window system events, and WinDefWindowProc() needs to recognize all of them.
Don’t panic: aim for simplicity first. We don’t need all those events right now, and even if we did, WinDefWindowProc() responds to almost all of them by just returning zero.
I’ve been collecting up books on OS/2 programming from the web (the Internet Archive has PDFs of so many programming books that are otherwise cluttering landfills now) and physical copies from Amazon where I can. I’m looking for quirks and tiny implementation details of these APIs. But, unexpected to me, the best resource turned out to be IBM’s SDK documentation.
Most things are covered, not just in basic function call info, but subtle interactions, window messages it generates, etc. I’m not sure why I assumed this would be lacking, but it seems to be the best route towards reimplementation.
So let’s reimplement! I knew immediately that I didn’t want to talk to X11 directly, because X11 sucks in general to work with and with luck it’s a dying system anyhow. Since I am (ahem) familiar with SDL, and Epic Games and I had spent so much time working with SDL as the backend of a GUI toolkit, I figured I’d start there. In 2ine, top-level windows (things with the desktop as their parent) generate an SDL window. Using the terminology of Java’s Swing framework, this is a “heavyweight” window. Child windows slice that heavyweight window into chunks, and since they don’t make an operating system window of their own, but just maintain some logical state about themselves, they’re “lightweight” windows.
The heavyweight window also does something else interesting: it creates an SDL_Renderer and an SDL_Texture to draw to. This lets us render drawing primitives to the window with OpenGL and keep a backing store of any rendering (so no need to send paint messages just because a window got dragged out of the way). Other benefits: clipping is basically free when a child window is drawing, and we can scale up apps that thought 800x600 was an impossibly massive screen resolution.
So about 1600 lines of C later, I had enough of the Presentation Manager implemented to get our little white square popping up on a Linux desktop, produced by an OS/2 binary.
Going further on this is non-trivial, though. The effort involved is probably about equivalent to the man-hours the Wine project needed to work before you could reasonably assume a Win16 program would function correctly. Trust me: there’s a lot of work to be done.
I do hope to do this work at some point, but I’m probably at the point where continuing on this is trying everyone’s patience, so I’m going to move back to something more practical next (and something impractical, like a video game). Still, it feels to good to have set out to climb a mountain, as ridiculous as that mountain might be to climb, and stand on a hill some distance above the ground. I will set camp here for now and examine some other mountains for a while.