N0RAMboot: A Zero RAM bootloader

Not so long ago made an XMODEM serial bootloader for the SEGA Master System (link) and while it works pretty well, it has a few restrictions like upload size (being written in C requires some RAM for working) and the uploaded ROM must be able to bootstrap and relocate itself within RAM once chainloaded. This works just fine to upload small programs to RAM but it feels bad not being able to use the full 8KB of RAM for the uploaded ROM. This would require the bootloader itself to use exactly Zero 0 bytes of RAM.

This project is already working! Download the latest ROM here! Source available at my source code repository.

Objectives

As described above, the objective is creating an XMODEM bootloader that uses zero bytes of RAM during its execution to allow uploading ROMs up to 8KB, the entire Master System RAM. In short:

Use 0 Bytes of RAM
Use 48K ROM at most (no mapper)
No cheating! VRAM used exclusively for Video
Written in pure Z80 Asm
Implement the XMODEM-CRC variant
Chainload cartridge/card if upload fails (BIOS replacement)

The side effects of these restrictions are:

No CALL RET instructions.
No interrupts (NMI will reset the console)
All function calls become inlined macros
Loop unrolls everywhere

Revision history

Version	Features
V1.3	Speed-up to 14400bps	Download
	On timeout, chainload other media
V1.2	Allow uploading data to VRAM	Download
	Recovery from transmission errors
V1.1	Speed-up to 9600bps	Download
V1.0	Successful XMODEM CRC upload 4800bps	Download

Future release features

Speed-up to 19200bps
Faster packet checks (less idle time)

Implementation

The following section will detail most of the implementation features and some challenges that have appeared.

Memory budget

While RAM will not be used, the Z80 has a healthy amount of registers at my disposal. A BC DE HL and their shadow counterparts will give me 14 bytes of storage, these are more or less “directly” accessible. The SP register will not be useful for its original purpose (no ram usage, no stack to point at) and can be moved back-and-forth to the HL register like this:

;SP <- HL
LD SP, HL

;HL <- SP
LD HL, #0x00
ADD HL, SP

This gives us two more bytes of memory budget, yielding a current total of 16 bytes of memory. Now, the IX and IY registers can be freely moved via undocumented opcodes as if they were the HL register. Prepending a 0xDD or a 0xFD byte to a LD opcode sourcing/targeting the H or L register, will convert them to IXH IXL IYH or IYL loads. For example:

;A <- IXL
.db 0xDD
LD A, L

;IXL <- A
.db 0xDD
LD L, A

Access to these increments our budget in 4 bytes, giving a grand total of 20 bytes of memory. So far so good, but just in case, let’s scrape the bottom of the memory barrel and gain an extra bit (yes bit) of memory to signal some binary event, like a success/fail return value. The obvious candidate is the carry flag C since it can be easily set-reset like this:

;Set carry flag
JR C, flag_set
CCF
flag_set:

;Clear carry flag
JR NC, flag_clear
CCF
flag_clear:

Otherwise it can be easily set/cleared while at the same time clearing A by doing:

;Clear both A and C
XOR A, A
;Set C
CCF

So our final memory budget is a grand total of: 20 bytes (+2 carry bits)

CRC-16

To check the integrity of each packet, XMODEM (CRC variant) uses a 16bit CRC algorithm. Since at most 32 packets will be needed for a successful transfer the prime target is wasting no registers, speed and ROM usage be damned… To an extent…

The reference C implementation uses either a bit-by-bit approach or a table based approach. The bit-by-bit implementation in Assembler managed to use only three (four bytes) registers, HL where the previous CRC is stored and where the next CRC value will be stored later, the A register, which contains the next byte, and the A' register for some temporal values. Nothing out of the ordinary, really.

For the table-based approach I wanted to trade off extra ROM usage and free up the A' register. That means the CRC implementation can only use the register HL occupied by the current/next CRC, and the A register, that contains the next byte. That means no indirect addressing nor calculated jumps since these would require the use of another HL register. The solution is less than pretty, but it works! And it ain’t too slow either.

The macro contains a crc16_word_table with 256 16-bit values, used for the CRC calculation. Then a macro that decrements the value of A, tests if it is zero, and if so apply the corresponding CRC operation for that bit.

;Macro crc16_test_byte(table_index)
;If A-- is zero, apply CRC.
DEC A
JR NZ, skip_byte

;if not, apply CRC
LD A, L
LD HL, (#crc_table + (2 * table_index))
XOR A, H
LD H, A

;Then clear A so no other subsequent macro
;corrupts the current CRC value
XOR A, A
skip_byte:

Cascading 256 instances of the macro, and incrementing the table_index value on each instance, we can calculate the CRC on zero extra registers.

;Current CRC on HL, next byte on A
;Calculate next table index using A
XOR A, H  ;Index stored on A

;Each macro call decrements A til it finds the correct
;table index without using any more regs :3
crc16_test_byte 1
crc16_test_byte 2
;[...]
crc16_test_byte 255
crc16_test_byte 0
;Next CRC on HL

Is not by any means pretty, and using this macro will add ~2KB of code to the ROM on each invocation but since it will only be needed on a single place, it can be justified.

Update: As of latest revision, CRC can be calculated bit wise using only the HL register, skipping the need of A' for temporary XOR constant. The critical part being calculation HL <- HL ^ 0x1021 which can be calculated using just RL and CCF as shown below.

;XOR HL with 0x1021 (0b0001000000100001)
;XOR H with 0001 0000
RL H ;Bit 7 ^ 0  (don't flip)
RL H ;Bit 6 ^ 0  (don't flip)
RL H ;Bit 5 ^ 0  (don't flip)
RL H ;Bit 4 ^ 1  (flip with CCF)
CCF

RL H ;Bit 3 ^ 0  (don't flip)
RL H ;Bit 2 ^ 0  (don't flip)
RL H ;Bit 1 ^ 0  (don't flip)
RL H ;Bit 0 ^ 0  (don't flip)

RL H ;Shift LSB stored in C

;XOR L with 0010 0001
RL L ;Bit 7 ^ 0 (don't flip)
RL L ;Bit 6 ^ 0 (don't flip)
RL L ;Bit 5 ^ 1 (flip with CCF)
CCF
RL L ;Bit 4 ^ 0 (don't flip)

RL L ;Bit 3 ^ 0 (don't flip)
RL L ;Bit 2 ^ 0 (don't flip)
RL L ;Bit 1 ^ 0 (don't flip)
RL L ;Bit 0 ^ 1 (flip with CCF)
CCF

RL L ;Shift LSB stored in C

UART transmission

The usual way to bit-bang UART is by making a delay loop. Grab some register, set a value, and waste some time running NOPs, however this solution is very “wasteful” in our use case, since registers and/or memory is something we do not have, hence the delay loop shall be unrolled.

The lower the baud rate, the bigger the unrolled delay loop will be but for compatibility with the previous bootloader, 4800bps will be the aim. For the delay loop itself we need a “safe” opcode, meaning no memory/register writes. The candidates are NOP CP BIT and direct JP opcodes.

Note: Compatibility not an issue any more, serial port now runs at 9600bps but the calculations on this section refer to the 4800bps version.

Opcode	Size	T-States	T-States/Byte
`JP nn`	3	10	3.3
`NOP`	1	4	4
`BIT b, (HL)`	2	12	6
`CP (HL)`	1	7	7

Initially using BIT b, (YX+n) (~6 T-States per byte) the UART test ROM weights 4490 bytes. Using CP (HL) + NOP it weights 3710 bytes. Each instance of the UART TX macro will use slightly above 1KB of ROM, making it quite expensive but not prohibitively so.

UART reception

Complementary to the transmission macros, the reception macro is marginally more tricky. Macro shall detect a low level on the RX line, wait for one and a half bit, then sample the 8 bits.

----------+     +-----+-----+-----+-----+-----+-----+-----+-----+-----
          |  S  |     |     |     |     |     |     |     |     |  S  
Detect st |  T  |  0  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  T  
bit ----> |  A  |     |     |     |     |     |     |     |     |  O  
          +-----+-----+-----+-----+-----+-----+-----+-----+-----+     
Wait half    |     |     |     |     |     |     |     |     |     |
bit ---------^     |     |     |     |     |     |     |     |     |
             |     |     |     |     |     |     |     |     |     |
Wait 1 bit & |     |     |     |     |     |     |     |     |     |
sample ------^-----^-----^-----^-----^-----^-----^-----^-----^-----^

Macro returns the read value on A and sets C on success. If the start bit cannot be detected, macro returns error by clearing C bit. Like the UART TX macro, every instance of this will cost you ~1KB of ROM. Register usage is the A register and the D register.

After some tuning, the sampling on the logic analyzer looked pretty good.

RX sampling

XMODEM implementation

At its barest minimum, XMODEM needs to keep track of the current packet, the current byte, and a pointer to the current RAM address being written. For this purpose the BC register is reserved for Byte and packet state, and for the memory pointer, the IX register is being used. For the rest of “State” the program flow itself and the use-for-everything registers AF and HL are enough.

The basic XMODEM (CRC) flow is as follows.

XMODEM Flowchart

Like it was said, a state machine controlling this flow needs essentially no memory. The exceptional cases not listed on the diagram would be trying to load a packet beyond the 8KB mark, on such case the bootloader should stop. And the event where the transfer never starts, on such case the bootloader may either stop or try to load the next media, essentially acting like a BIOS replacement.

VRAM

As of version V1.2, the bootloader will accept ROMs of up to 24K to allow uploading VRAM data together with the program ROM. In this way, the first 8K will be uploaded to the console’s RAM and up to 16K of data following the program ROM will be uploaded to the VRAM, saving on program ROM usage.

The ROM format with VRAM data is as follows.

Offset   Description
0x0000 +-------------+
       | Program ROM |
0x1FFF +-------------+  <-- Up to 8K loaded to RAM
0x2000 +-------------+  <-- If >8K, data is loaded to VRAM
       |             |
       |  VRAM data  |
       |             |
0x5FFF +-------------+  <-- Up to a maximum of 24K

Performance

With the implementation completed, the throughput is ~785 Bytes/s, including XMODEM overhead, the Master System checking the CRC, and the PC/SMS latency. In the end it gives around an 82% channel usage from a 960 Bytes/s maximum capacity of the serial port running at 9600bps.

Throughput

For the curious, here is a complete XMODEM transmission under the logic analyzer. At any given time there is only one outstanding byte on both lines, making implementation quite easy.

Complete XMODEM transmission

A side effect of XMODEM being a lockstep protocol is that it will steadily lose efficiency the faster the baud rate is, since the sender/receiver latency and the time it takes the receiver to validate a packet remains constant even as the baud rate increases. In our case, latency + validation takes 21.5ms and the top efficiency at different baud rates would be:

Baud rate	TX time (24K)	Efficiency	Gain
9600	31s	86%	N/A
14400	22s	81%	30%
19200	18s	76%	18%

Videos

June 2021 update: Additional videos available here.

Simple “Hello world!”

Video of the V1.2 ROM loading a sample program at 9600bps. Download as MP4 file.

Program with VRAM data

Video of the V1.2 ROM loading a ROM with VRAM data preloaded. Download video as MP4 file. Sample ROM available on the source repository.