N0RAMboot: A Zero RAM bootloader
Not so long ago made an XMODEM serial bootloader for the SEGA Master System
(link) and while
it works pretty well, it has a few restrictions like upload size (being
written in C requires some RAM for working) and the uploaded ROM must be
able to bootstrap and relocate itself within RAM once chainloaded. This works
just fine to upload small programs to RAM but it feels bad not being able to
use the full 8KB of RAM for the uploaded ROM. This would require the
bootloader itself to use exactly Zero 0
bytes of RAM.
This project is already working! Download the latest ROM here! Source available at my source code repository.
Objectives
As described above, the objective is creating an XMODEM bootloader that uses zero bytes of RAM during its execution to allow uploading ROMs up to 8KB, the entire Master System RAM. In short:
- Use 0 Bytes of RAM
- Use 48K ROM at most (no mapper)
- No cheating! VRAM used exclusively for Video
- Written in pure Z80 Asm
- Implement the XMODEM-CRC variant
- Chainload cartridge/card if upload fails (BIOS replacement)
The side effects of these restrictions are:
- No
CALL
RET
instructions. - No interrupts (
NMI
will reset the console) - All function calls become inlined macros
- Loop unrolls everywhere
Revision history
Version | Features | |
---|---|---|
V1.3 | Speed-up to 14400bps | Download |
On timeout, chainload other media | ||
V1.2 | Allow uploading data to VRAM | Download |
Recovery from transmission errors | ||
V1.1 | Speed-up to 9600bps | Download |
V1.0 | Successful XMODEM CRC upload 4800bps | Download |
Future release features
- Speed-up to 19200bps
- Faster packet checks (less idle time)
Implementation
The following section will detail most of the implementation features and some challenges that have appeared.
Memory budget
While RAM will not be used, the Z80 has a healthy amount of registers at my
disposal. A
BC
DE
HL
and their shadow counterparts will give me 14
bytes of storage, these are more or less “directly” accessible. The SP
register will not be useful for its original purpose (no ram usage, no stack to
point at) and can be moved back-and-forth to the HL
register like this:
;SP <- HL
LD SP, HL
;HL <- SP
LD HL, #0x00
ADD HL, SP
This gives us two more bytes of memory budget, yielding a current total of 16
bytes of memory. Now, the IX
and IY
registers can be freely moved via
undocumented opcodes as if they were the HL
register. Prepending a 0xDD
or a 0xFD
byte to a LD
opcode sourcing/targeting the H
or L
register,
will convert them to IXH
IXL
IYH
or IYL
loads. For example:
;A <- IXL
.db 0xDD
LD A, L
;IXL <- A
.db 0xDD
LD L, A
Access to these increments our budget in 4 bytes, giving a grand total of
20 bytes of memory. So far so good, but just in case, let’s scrape the bottom
of the memory barrel and gain an extra bit (yes bit) of memory to signal some
binary event, like a success/fail return value. The obvious candidate is the
carry flag C
since it can be easily set-reset like this:
;Set carry flag
JR C, flag_set
CCF
flag_set:
;Clear carry flag
JR NC, flag_clear
CCF
flag_clear:
Otherwise it can be easily set/cleared while at the same time clearing A
by doing:
;Clear both A and C
XOR A, A
;Set C
CCF
So our final memory budget is a grand total of: 20 bytes (+2 carry bits)
CRC-16
To check the integrity of each packet, XMODEM (CRC variant) uses a 16bit CRC algorithm. Since at most 32 packets will be needed for a successful transfer the prime target is wasting no registers, speed and ROM usage be damned… To an extent…
The reference C implementation uses either a bit-by-bit approach or a table
based approach. The bit-by-bit implementation in Assembler managed to use only
three (four bytes) registers, HL
where the previous CRC is stored and where
the next CRC value will be stored later, the A
register, which contains the
next byte, and the A'
register for some temporal values. Nothing out of the
ordinary, really.
For the table-based approach I wanted to trade off extra ROM usage and free up
the A'
register. That means the CRC implementation can only use the
register HL
occupied by the current/next CRC, and the A
register, that
contains the next byte. That means no indirect addressing nor calculated
jumps since these would require the use of another HL
register. The solution
is less than pretty, but it works! And it ain’t too slow either.
The macro contains a crc16_word_table
with 256 16-bit values, used for the
CRC calculation. Then a macro that decrements the value of A
, tests if it is
zero, and if so apply the corresponding CRC operation for that bit.
;Macro crc16_test_byte(table_index)
;If A-- is zero, apply CRC.
DEC A
JR NZ, skip_byte
;if not, apply CRC
LD A, L
LD HL, (#crc_table + (2 * table_index))
XOR A, H
LD H, A
;Then clear A so no other subsequent macro
;corrupts the current CRC value
XOR A, A
skip_byte:
Cascading 256 instances of the macro, and incrementing the table_index
value
on each instance, we can calculate the CRC on zero extra registers.
;Current CRC on HL, next byte on A
;Calculate next table index using A
XOR A, H ;Index stored on A
;Each macro call decrements A til it finds the correct
;table index without using any more regs :3
crc16_test_byte 1
crc16_test_byte 2
;[...]
crc16_test_byte 255
crc16_test_byte 0
;Next CRC on HL
Is not by any means pretty, and using this macro will add ~2KB of code to the ROM on each invocation but since it will only be needed on a single place, it can be justified.
Update: As of latest revision, CRC can be calculated bit wise using only the
HL
register, skipping the need of A'
for temporary XOR constant. The
critical part being calculation HL <- HL ^ 0x1021
which can be calculated
using just RL
and CCF
as shown below.
;XOR HL with 0x1021 (0b0001000000100001)
;XOR H with 0001 0000
RL H ;Bit 7 ^ 0 (don't flip)
RL H ;Bit 6 ^ 0 (don't flip)
RL H ;Bit 5 ^ 0 (don't flip)
RL H ;Bit 4 ^ 1 (flip with CCF)
CCF
RL H ;Bit 3 ^ 0 (don't flip)
RL H ;Bit 2 ^ 0 (don't flip)
RL H ;Bit 1 ^ 0 (don't flip)
RL H ;Bit 0 ^ 0 (don't flip)
RL H ;Shift LSB stored in C
;XOR L with 0010 0001
RL L ;Bit 7 ^ 0 (don't flip)
RL L ;Bit 6 ^ 0 (don't flip)
RL L ;Bit 5 ^ 1 (flip with CCF)
CCF
RL L ;Bit 4 ^ 0 (don't flip)
RL L ;Bit 3 ^ 0 (don't flip)
RL L ;Bit 2 ^ 0 (don't flip)
RL L ;Bit 1 ^ 0 (don't flip)
RL L ;Bit 0 ^ 1 (flip with CCF)
CCF
RL L ;Shift LSB stored in C
UART transmission
The usual way to bit-bang UART is by making a delay loop. Grab some register, set a value, and waste some time running NOPs, however this solution is very “wasteful” in our use case, since registers and/or memory is something we do not have, hence the delay loop shall be unrolled.
The lower the baud rate, the bigger the unrolled delay loop will be but for
compatibility with the previous bootloader, 4800bps will be the aim. For the
delay loop itself we need a “safe” opcode, meaning no memory/register writes.
The candidates are NOP
CP
BIT
and direct JP
opcodes.
Note: Compatibility not an issue any more, serial port now runs at 9600bps but the calculations on this section refer to the 4800bps version.
Opcode | Size | T-States | T-States/Byte |
---|---|---|---|
JP nn |
3 | 10 | 3.3 |
NOP |
1 | 4 | 4 |
BIT b, (HL) |
2 | 12 | 6 |
CP (HL) |
1 | 7 | 7 |
Initially using BIT b, (YX+n)
(~6 T-States per byte) the UART test ROM
weights 4490 bytes. Using CP (HL)
+ NOP
it weights 3710 bytes. Each
instance of the UART TX macro will use slightly above 1KB of ROM, making
it quite expensive but not prohibitively so.
UART reception
Complementary to the transmission macros, the reception macro is marginally more tricky. Macro shall detect a low level on the RX line, wait for one and a half bit, then sample the 8 bits.
----------+ +-----+-----+-----+-----+-----+-----+-----+-----+-----
| S | | | | | | | | | S
Detect st | T | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | T
bit ----> | A | | | | | | | | | O
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
Wait half | | | | | | | | | |
bit ---------^ | | | | | | | | |
| | | | | | | | | |
Wait 1 bit & | | | | | | | | | |
sample ------^-----^-----^-----^-----^-----^-----^-----^-----^-----^
Macro returns the read value on A
and sets C
on success. If the start bit
cannot be detected, macro returns error by clearing C
bit. Like the UART TX
macro, every instance of this will cost you ~1KB of ROM. Register usage is the
A
register and the D
register.
After some tuning, the sampling on the logic analyzer looked pretty good.
XMODEM implementation
At its barest minimum, XMODEM needs to keep track of the current packet, the
current byte, and a pointer to the current RAM address being written. For this
purpose the BC
register is reserved for Byte and packet state, and for the
memory pointer, the IX
register is being used. For the rest of “State” the
program flow itself and the use-for-everything registers AF
and HL
are
enough.
The basic XMODEM (CRC) flow is as follows.
Like it was said, a state machine controlling this flow needs essentially no memory. The exceptional cases not listed on the diagram would be trying to load a packet beyond the 8KB mark, on such case the bootloader should stop. And the event where the transfer never starts, on such case the bootloader may either stop or try to load the next media, essentially acting like a BIOS replacement.
VRAM
As of version V1.2, the bootloader will accept ROMs of up to 24K to allow uploading VRAM data together with the program ROM. In this way, the first 8K will be uploaded to the console’s RAM and up to 16K of data following the program ROM will be uploaded to the VRAM, saving on program ROM usage.
The ROM format with VRAM data is as follows.
Offset Description
0x0000 +-------------+
| Program ROM |
0x1FFF +-------------+ <-- Up to 8K loaded to RAM
0x2000 +-------------+ <-- If >8K, data is loaded to VRAM
| |
| VRAM data |
| |
0x5FFF +-------------+ <-- Up to a maximum of 24K
Performance
With the implementation completed, the throughput is ~785 Bytes/s, including XMODEM overhead, the Master System checking the CRC, and the PC/SMS latency. In the end it gives around an 82% channel usage from a 960 Bytes/s maximum capacity of the serial port running at 9600bps.
For the curious, here is a complete XMODEM transmission under the logic analyzer. At any given time there is only one outstanding byte on both lines, making implementation quite easy.
A side effect of XMODEM being a lockstep protocol is that it will steadily lose efficiency the faster the baud rate is, since the sender/receiver latency and the time it takes the receiver to validate a packet remains constant even as the baud rate increases. In our case, latency + validation takes 21.5ms and the top efficiency at different baud rates would be:
Baud rate | TX time (24K) | Efficiency | Gain |
---|---|---|---|
9600 | 31s | 86% | N/A |
14400 | 22s | 81% | 30% |
19200 | 18s | 76% | 18% |
Videos
June 2021 update: Additional videos available here.
Simple “Hello world!”
Video of the V1.2 ROM loading a sample program at 9600bps. Download as MP4 file.
Program with VRAM data
Video of the V1.2 ROM loading a ROM with VRAM data preloaded. Download video as MP4 file. Sample ROM available on the source repository.