Dromaiidae - Emu

In previous post I wrote that I'm taking a break from USB on RasPi for a while. I need some fresh air in order to keep myself concentrated on the work. If I would just stick at USB for longer it would develop at much slower pace. So, taking a break but staying with AROS, or with ARM.

I wrote you once that in the future there will be a possibility to use m68k binaries with AROS on RasPi, and I really mean it. So, few months ago I've started to look around for any available m68k emulators. And you know what? There are plenty of them. Hurray? No. They are either slow interpreters written eons ago as a proof of concept, or GPL licensed. Even though I find GPL ok and use it myself from time to time, I find it too restrictive in this case. Besides I wanted to try something new :)

So, here it goes, my new project on github - Emu68. The project is in very early stage and there is almost no code inside but hey, it's 4 days old, only.

What is Emu68? It is going to be a JIT emulator for 680x0 processors. It is not going to be a pure m68k emulation with all bells and whistles (maybe it will get the completeness later on), only the parts necessary to run the code.

Emu68 provides a large instruction cache, where blocks of up to 32 m68k instructions (this is a configurable parameter) are translated into blocks of ARM instructions. Each translation unit stored in the cache has two nodes - one is used to link the translation within a hash table (for quick fetches), another node is used to keep track of usage frequency. Here I am using simplest LRU strategy - newly added units are stored in front of the LRU, units which were in the cache and are fetched for execution are moved in front of LRU too, the units  which are going to be flushed are removed form the end of the LRU queue. Please note that all this operations are O(1).

Within each block the JIT translator has 10 free ARM registers for its own use - as intermediate calculation results or as a cache for up to 8 m68k registers. Additionally, one ARM register is used as program counter for m68k and another one keeps the status flags. In order to reduce number of ARM instructions, the status flags are updated only if really necessary. If subsequent instruction is known to change some flags, these will be not updated by the current instruction. It saves a lot since nearly every single m68k opcode does affect the flags in some way. 

What do I have already? Not much, but the bits and pieces work as expected:

  • Calculation of effective address combined with optional fetch or store from/to EA (combining it spares eventually one ARM instruction)
  • MOVE/MOVEQ instructions
  • ADDI/SUBI instructions
  • Instruction cache maintenance

How does it look like? Let's have a look at simplest m68k code:

move.l #-559038737,d0
move.l d0,d1
move.w #-13570,d1
move.l d0,d7
addi.b #-16,d7

Nothing special, but I wanted to see if move.w works as expected (i.e. does not trash upper 16 bits) and if addi.b sets the CPU flags properly. Letting my JIT execute it generates following debug output:

[ICache] GetTranslationUnit(0x8a090)
[ICache] Hash: 0xa098
[ICache] Creating new translation unit at 0xf63ff060
[ICache] ARM code entry at 0xf63ff080
[ICache] Translated 5 M68k instructions to 32 ARM instructions
[ICache] Trimming translation unit length to 160 bytes

so far so good, 32 ARM instructions is not a lot and 5 of them are automatically there independent on the code size. So, let's look at the ARM assembly output:

ldr ip, [fp, #76] ; ip = m68k->PC
ldrh sl, [fp, #80] ; sl = m68k->SR
ldr r0, [ip, #2]   ; r0 = immediate -559038737.L
mov r1, r0            ; r1(D0) = r0
add ip, ip, #6      ; ip(PC) = ip(PC) + 6
mov r0, r1            ; r0(D1) = r1(D0)
add ip, ip, #2      ; ip(PC) = ip(PC) + 2
ldrh r2, [ip, #2]   ; r2 = immediate -13570.W
lsr r0, r0, #16   ; Clear lower 16 bits of r0(D1)
lsl r0, r0, #16
uxtah r0, r0, r2 ; r0(D1) = r0(D1) + r2(lower 16 bits only)
add ip, ip, #4      ; ip(PC) = ip(PC) + 4
mov r2, r1            ; r2(D7) = r1(D0)
add ip, ip, #2      ; ip(PC) = ip(PC) + 2
ldrb r3, [ip, #3]   ; r3 = immediate -16.B
lsl r3, r3, #24   ; r3 = r3 << 24
adds r3, r3, r2, lsl #24 ; r3 = r3 + r2(D7) << 24
lsr r3, r3, #24    ; r3 = r3 >> 24
bic r2, r2, #255  ; r2(D7) = r2(D7) & 0xffffff00
uxtab r2, r2, r3 ; r2(D7) = r2(D7) + r3(lower 8 bits only)
add ip, ip, #4      ; ip(PC) = ip(PC) + 4
bic sl, sl, #31     ; Clear Z N C V of m68k context
orrmi sl, sl, #8 ; if ARM.N was set, set m68k.N
orreq sl, sl, #4 ; if ARM.Z was set, set m68k.Z
orrvs sl, sl, #2 ; if ARM.V was set, set m68k.V
orrcs sl, sl, #17 ; if ARM.C was set, set m68k.C
str r1, [fp]            ; m68k->D0 = r1
str r0, [fp, #4]     ; m68k->D1 = r0
str r2, [fp, #28]   ; m68k->D7 = r2
strh sl, [fp, #80] ; m68k->SR = sl
str ip, [fp, #76] ; m68k->PC = ip
bx lr

Why 24 bit shifting? Because I do not want to calculate the status flags by myself. I prefer ARM cpu doing that for  me. As you can see in the code, first two instructions load PC and SR (most frequently used, therefore always cached), then the m68k code begins. At the end of translation unit, all modified (dirty) registers are stored back in the m68k context. Finally, PC and SR are stored too and the code returns to JIT. Once the code returns I wrote back m68k context:

D0 = 0xdeadbeef D1 = 0xdeadcafe D2 = 0x00000000 D3 = 0x00000000
D4 = 0x00000000 D5 = 0x00000000 D6 = 0x00000000 D7 = 0xdeadbedf
A0 = 0x00000000 A1 = 0x00000000 A2 = 0x00000000 A3 = 0x00000000
A4 = 0x00000000 A5 = 0x00000000 A6 = 0x00000000 A7 = 0x00000000
PC = 0x0008a0a2 SR = XN..C
USP= 0x00000000 MSP= 0x00000000 ISP= 0x00000000

As you can see the registers are set properly, CPU flags are set correctly, too.

The example was pretty simple, the ones with much more complex addressing modes are working too. During next days I will implement more instructions and will eventually start some benchmarking.


Tier Benefits
Recent Posts