Header

Not so long ago made an XMODEM serial bootloader for the SEGA Master System (link) and while it works pretty well, it has a few restrictions like ROM size (being written mostly in C requires some RAM for working) and the uploaded ROM being able to bootstrap and relocate itself into memory once chainloaded. This works just fine to upload small programs to RAM but it feels bad not being able to use the full 8KB of RAM for the uploaded ROM. This would require the bootloader itself to use exactly Zero 0 bytes of RAM.

This project is already working! Download the latest ROM here! Source available at my BitBucket repository.

Objectives

As described above, the objective is creating an XMODEM bootloader that uses zero bytes of RAM during its execution as to allow uploads ROMs up to 8KB, the entire Master System RAM. In short:

  • Use 0 Bytes of RAM
  • Use 48K ROM at most (no mapper)
  • No cheating! VRAM used exclusively for Video
  • Written in pure Z80 Asm
  • Implement the XMODEM-CRC variant
  • Chainload cartridge/card if upload fails (BIOS replacement)

The side effects of these restrictions are obvious:

  • No CALL RET instructions.
  • No interrupts (NMI will reset the console)
  • All function calls become inlined macros
  • Loop unrolls everywhere

Current state

  • Succesful XMODEM CRC bootload (V1.0)
  • Speed-up to 9600bps (V1.1)
  • Recovery from faulty transmission
  • Chainload other media

Implementation

The following section will detail most of the implementation features and some challenges that have appeared.

Memory budget

While RAM will not be used, the Z80 has a healthy ammount of registers at my disposal. A BC DE HL and their shadow counterparts will give me 14 bytes of storage, these are more or less “directly” accesible. The SP register will not be useful for its original purpose (no ram usage, no stack to point at) and can be moved back-and-forth to the HL register like this:

;SP <- HL
LD SP, HL
;HL <- SP
LD HL, #0x00
ADD HL, SP

This gives us two more bytes of memory budget, giving a current total of 16 bytes of memory. Now, the IX and IY registers can be freely moved via undocumented opcodes as if they were the HL register. Prepending a 0xDD or a 0xFD byte to a LD opcode sourcing/targetting the H or L register, will convert them to IXH IXL IYH or IYL loads. For example:

;A <- IXL
.db 0xDD
LD A, L
;IXL <- A
.db 0xDD
LD L, A

Access to these increments our budget in 4 bytes, giving a grand total of 20 bytes of memory. So far so good, but just in case, let’s scrape the bottom of the memory barrel and gain an extra bit (yes bit) of memory to signal some binary event, like a success/fail return value. The obvious candidate is the carry flag C since it can be easily set-reset like this:

;Set carry flag
JR C, flag_set
CCF
flag_set:
;Clear carry flag
JR NC, flag_clear
CCF
flag_clear:

Otherwise it can be easily set/cleared while at the same time clearing A by doing:

;Clear both A and C
XOR A, A
;Set C
CCF

So our final memory budget is up to: 20 bytes (+2 carry bits)

CRC-16

To check the integrity of each packet, XMODEM (CRC variant) uses a 16bit CRC algorithm. Since at most 32 packets will be needed for a successful transfer the prime target is wasting no registers, speed and ROM usage be damned… To an extent…

The reference C implementation uses either a bit-by-bit approach or a table based approach. The bit-by-bit implementation in Assembler managed to use only three (four) registers, HL where the previous CRC is stored and where the next CRC value will be stored later, the A register, which contains the next byte, and the A' register for some minor tests. Nothing out of the ordinary, really.

For the table-ased approach I wanted to trade off extra ROM usage and free up the A' register. That means the CRC implementation can only use the register HL occupied by the current/next CRC, and the A register, that contains the next byte. That means no indirect addressing nor calculated jumps since these would require the use of another HLregister. The solution is less than pretty, but it works! And it ain’t too slow either.

The macro contains a crc16_word_table with 256 16-bit values, used for the CRC calculation. Then a macro that decrements the value of A, tests if it is zero, and if so apply the corresponding CRC operation for that bit.

;Macro crc16_test_byte(table_index)
;If A-- is zero, apply CRC.
DEC A
JR NZ, skip_byte

;if not, aplly CRC
LD A, L
LD HL, (#crc_table + (2 * table_index))
XOR A, H
LD H, A

;Then clear A so no other subsequent macro
;corrupts the current CRC value
XOR A, A
skip_byte:

Cascading 256 instances of the macro, and incrementing the table_index value on each instance, we can calculate the CRC on zero extra registers.

;Current CRC on HL, next byte on A
;Calculate next table index using A
XOR A, H  ;Index stored on A

;Each macro call decrements A til it finds the correct
;table index without using any more regs :3
crc16_test_byte 1
crc16_test_byte 2
[...]
crc16_test_byte 255
crc16_test_byte 0
;Next CRC on HL

Is not by any means pretty, and using this macro will add ~2KB of code to the ROM on each invocation but since it will only be needed on a single place, it can be justified.

UART transmission

The usual way to bit-bang UART is by making a delay loop. Grab some register, set a value, and waste some time running NOPs, however this solution is very “wasteful” in this use case, since registers and/or memory is something we do not have, hence the delay loop shall be unrolled.

The lower the baudrate, the bigger the unrolled delay loop will be but for compatibility with the previous bootloader, 4800bps will be the aim. For the delay loop itsels we need a “safe” opcode, meaning no memory/register writes. The candidates are NOP CP BIT and direct JP opcodes.

Note: Compatibility not an issue any more, serial port now runs at 9600bps.

Opcode Size T-States T-States/Byte
JP nn 3 10 3.3
NOP 1 4 4
BIT b, (HL) 2 12 6
CP (HL) 1 7 7
       

Initially using BIT b, (YX+n) (~6 T-States per byte) the UART test ROM weights 4490 bytes. Using CP (HL) + NOP it weights 3710 bytes. Each instance of the UART TX macro will use slightly above 1KB of ROM, making it quite expensive but not prohibitely so.

UART reception

Complementary to the transmission macros, the reception macro is marginally more tricky. Macro shall detect a low level on the RX line, wait for one and a half bit, then sample the 8 bits.

----------+     +-----+-----+-----+-----+-----+-----+-----+-----+-----
          |  S  |     |     |     |     |     |     |     |     |  S  
Detect st |  T  |  0  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  T  
bit ----> |  A  |     |     |     |     |     |     |     |     |  O  
          +-----+-----+-----+-----+-----+-----+-----+-----+-----+     
Wait half    |     |     |     |     |     |     |     |     |     |
bit ---------^     |     |     |     |     |     |     |     |     |
             |     |     |     |     |     |     |     |     |     |
Wait 1 bit & |     |     |     |     |     |     |     |     |     |
sample ------^-----^-----^-----^-----^-----^-----^-----^-----^-----^

Macro returns the read value on A and sets C on success. If the start bit cannot be detected, macro returns error by clearing C bit. Like the UART TX macro, every instance of this will cost you ~1KB of ROM. Register usage is the A register and the D register.

After some tuning, the sampling on the logic analyzer looked pretty good.

rx sampling

XMODEM implementation

At its barest minimu, XMODEM needs to keep track of the current packet, the current byte, and a pointer to the current RAM address being written. For this purpose the BC register is reserved for Byte and paCket state, and for the memory pointer, the IX register is being used. For the rest of “State” the program flow itself and the use-for-everything registers AF and HL are enough.

The basic XMODEM (CRC) flow is as follows.

XMODEM Flowchart

Like it was said, a state machine controling this flow needs essentially no memory. The exceptional cases not listed on the diagram would be trying to load a packet beyond the 8KB mark, on such case the bootloaders should stop. And the event where the transfer never starts, on such case the bootloader may either stop or try to load the next media, essentially acting like a BIOS replacement.

Performance

With the implementation completed, the throughput is ~785 Bytes/s, including XMODEM overhead, the Master System checking the CRC, and the PC/SMS latency. In the end it gives sround an 82% channel usage from a 960 Bytes/s maximum capacity.

Througput

For the curious, here is a complete XMODEM transmission under the logic analyzer. At any given time there is only one outstanding byte on both lines, making implementation quite easy.

Complete XMODEM transmission

Working Video

Video of the first working ROM, at 4800bps. Download as MP4 file.

More to come! With videos!

Stay tuned for further updates!