Not so long ago made an XMODEM serial bootloader for the SEGA Master System
(link) and while
it works pretty well, it has a few restrictions like upload size (being
written in C requires some RAM for working) and the uploaded ROM must be
able to bootstrap and relocate itself within RAM once chainloaded. This works
just fine to upload small programs to RAM but it feels bad not being able to
use the full 8KB of RAM for the uploaded ROM. This would require the
bootloader itself to use exactly Zero
0 bytes of RAM.
As described above, the objective is creating an XMODEM bootloader that uses zero bytes of RAM during its execution to allow uploading ROMs up to 8KB, the entire Master System RAM. In short:
- Use 0 Bytes of RAM
- Use 48K ROM at most (no mapper)
- No cheating! VRAM used exclusively for Video
- Written in pure Z80 Asm
- Implement the XMODEM-CRC variant
- Chainload cartridge/card if upload fails (BIOS replacement)
The side effects of these restrictions are:
- No interrupts (
NMIwill reset the console)
- All function calls become inlined macros
- Loop unrolls everywhere
|V1.3||Speed-up to 14400bps||Download|
|On timeout, chainload other media|
|V1.2||Allow uploading data to VRAM||Download|
|Recovery from transmission errors|
|V1.1||Speed-up to 9600bps||Download|
|V1.0||Successful XMODEM CRC upload 4800bps||Download|
Future release features
- Speed-up to 19200bps
- Faster packet checks (less idle time)
The following section will detail most of the implementation features and some challenges that have appeared.
While RAM will not be used, the Z80 has a healthy amount of registers at my
HL and their shadow counterparts will give me 14
bytes of storage, these are more or less “directly” accessible. The
register will not be useful for its original purpose (no ram usage, no stack to
point at) and can be moved back-and-forth to the
HL register like this:
;SP <- HL LD SP, HL
;HL <- SP LD HL, #0x00 ADD HL, SP
This gives us two more bytes of memory budget, yielding a current total of 16
bytes of memory. Now, the
IY registers can be freely moved via
undocumented opcodes as if they were the
HL register. Prepending a
0xFD byte to a
LD opcode sourcing/targeting the
will convert them to
IYL loads. For example:
;A <- IXL .db 0xDD LD A, L
;IXL <- A .db 0xDD LD L, A
Access to these increments our budget in 4 bytes, giving a grand total of
20 bytes of memory. So far so good, but just in case, let’s scrape the bottom
of the memory barrel and gain an extra bit (yes bit) of memory to signal some
binary event, like a success/fail return value. The obvious candidate is the
C since it can be easily set-reset like this:
;Set carry flag JR C, flag_set CCF flag_set:
;Clear carry flag JR NC, flag_clear CCF flag_clear:
Otherwise it can be easily set/cleared while at the same time clearing
;Clear both A and C XOR A, A ;Set C CCF
So our final memory budget is a grand total of: 20 bytes (+2 carry bits)
To check the integrity of each packet, XMODEM (CRC variant) uses a 16bit CRC algorithm. Since at most 32 packets will be needed for a successful transfer the prime target is wasting no registers, speed and ROM usage be damned… To an extent…
The reference C implementation uses either a bit-by-bit approach or a table
based approach. The bit-by-bit implementation in Assembler managed to use only
three (four bytes) registers,
HL where the previous CRC is stored and where
the next CRC value will be stored later, the
A register, which contains the
next byte, and the
A' register for some temporal values. Nothing out of the
For the table-based approach I wanted to trade off extra ROM usage and free up
A' register. That means the CRC implementation can only use the
HL occupied by the current/next CRC, and the
A register, that
contains the next byte. That means no indirect addressing nor calculated
jumps since these would require the use of another
HL register. The solution
is less than pretty, but it works! And it ain’t too slow either.
The macro contains a
crc16_word_table with 256 16-bit values, used for the
CRC calculation. Then a macro that decrements the value of
A, tests if it is
zero, and if so apply the corresponding CRC operation for that bit.
;Macro crc16_test_byte(table_index) ;If A-- is zero, apply CRC. DEC A JR NZ, skip_byte ;if not, apply CRC LD A, L LD HL, (#crc_table + (2 * table_index)) XOR A, H LD H, A ;Then clear A so no other subsequent macro ;corrupts the current CRC value XOR A, A skip_byte:
Cascading 256 instances of the macro, and incrementing the
on each instance, we can calculate the CRC on zero extra registers.
;Current CRC on HL, next byte on A ;Calculate next table index using A XOR A, H ;Index stored on A ;Each macro call decrements A til it finds the correct ;table index without using any more regs :3 crc16_test_byte 1 crc16_test_byte 2 ;[...] crc16_test_byte 255 crc16_test_byte 0 ;Next CRC on HL
Is not by any means pretty, and using this macro will add ~2KB of code to the ROM on each invocation but since it will only be needed on a single place, it can be justified.
Update: As of latest revision, CRC can be calculated bit wise using only the
HL register, skipping the need of
A' for temporary XOR constant. The
critical part being calculation
HL <- HL ^ 0x1021 which can be calculated
CCF as shown below.
;XOR HL with 0x1021 (0b0001000000100001) ;XOR H with 0001 0000 RL H ;Bit 7 ^ 0 (don't flip) RL H ;Bit 6 ^ 0 (don't flip) RL H ;Bit 5 ^ 0 (don't flip) RL H ;Bit 4 ^ 1 (flip with CCF) CCF RL H ;Bit 3 ^ 0 (don't flip) RL H ;Bit 2 ^ 0 (don't flip) RL H ;Bit 1 ^ 0 (don't flip) RL H ;Bit 0 ^ 0 (don't flip) RL H ;Shift LSB stored in C ;XOR L with 0010 0001 RL L ;Bit 7 ^ 0 (don't flip) RL L ;Bit 6 ^ 0 (don't flip) RL L ;Bit 5 ^ 1 (flip with CCF) CCF RL L ;Bit 4 ^ 0 (don't flip) RL L ;Bit 3 ^ 0 (don't flip) RL L ;Bit 2 ^ 0 (don't flip) RL L ;Bit 1 ^ 0 (don't flip) RL L ;Bit 0 ^ 1 (flip with CCF) CCF RL L ;Shift LSB stored in C
The usual way to bit-bang UART is by making a delay loop. Grab some register, set a value, and waste some time running NOPs, however this solution is very “wasteful” in our use case, since registers and/or memory is something we do not have, hence the delay loop shall be unrolled.
The lower the baud rate, the bigger the unrolled delay loop will be but for
compatibility with the previous bootloader, 4800bps will be the aim. For the
delay loop itself we need a “safe” opcode, meaning no memory/register writes.
The candidates are
BIT and direct
Note: Compatibility not an issue any more, serial port now runs at 9600bps but the calculations on this section refer to the 4800bps version.
BIT b, (YX+n) (~6 T-States per byte) the UART test ROM
weights 4490 bytes. Using
CP (HL) +
NOP it weights 3710 bytes. Each
instance of the UART TX macro will use slightly above 1KB of ROM, making
it quite expensive but not prohibitively so.
Complementary to the transmission macros, the reception macro is marginally more tricky. Macro shall detect a low level on the RX line, wait for one and a half bit, then sample the 8 bits.
----------+ +-----+-----+-----+-----+-----+-----+-----+-----+----- | S | | | | | | | | | S Detect st | T | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | T bit ----> | A | | | | | | | | | O +-----+-----+-----+-----+-----+-----+-----+-----+-----+ Wait half | | | | | | | | | | bit ---------^ | | | | | | | | | | | | | | | | | | | Wait 1 bit & | | | | | | | | | | sample ------^-----^-----^-----^-----^-----^-----^-----^-----^-----^
Macro returns the read value on
A and sets
C on success. If the start bit
cannot be detected, macro returns error by clearing
C bit. Like the UART TX
macro, every instance of this will cost you ~1KB of ROM. Register usage is the
A register and the
After some tuning, the sampling on the logic analyzer looked pretty good.
At its barest minimum, XMODEM needs to keep track of the current packet, the
current byte, and a pointer to the current RAM address being written. For this
BC register is reserved for Byte and packet state, and for the
memory pointer, the
IX register is being used. For the rest of “State” the
program flow itself and the use-for-everything registers
The basic XMODEM (CRC) flow is as follows.
Like it was said, a state machine controlling this flow needs essentially no memory. The exceptional cases not listed on the diagram would be trying to load a packet beyond the 8KB mark, on such case the bootloader should stop. And the event where the transfer never starts, on such case the bootloader may either stop or try to load the next media, essentially acting like a BIOS replacement.
As of version V1.2, the bootloader will accept ROMs of up to 24K to allow uploading VRAM data together with the program ROM. In this way, the first 8K will be uploaded to the console’s RAM and up to 16K of data following the program ROM will be uploaded to the VRAM, saving on program ROM usage.
The ROM format with VRAM data is as follows.
Offset Description 0x0000 +-------------+ | Program ROM | 0x1FFF +-------------+ <-- Up to 8K loaded to RAM 0x2000 +-------------+ <-- If >8K, data is loaded to VRAM | | | VRAM data | | | 0x5FFF +-------------+ <-- Up to a maximum of 24K
With the implementation completed, the throughput is ~785 Bytes/s, including XMODEM overhead, the Master System checking the CRC, and the PC/SMS latency. In the end it gives around an 82% channel usage from a 960 Bytes/s maximum capacity of the serial port running at 9600bps.
For the curious, here is a complete XMODEM transmission under the logic analyzer. At any given time there is only one outstanding byte on both lines, making implementation quite easy.
A side effect of XMODEM being a lockstep protocol is that it will steadily lose efficiency the faster the baud rate is, since the sender/receiver latency and the time it takes the receiver to validate a packet remains constant even as the baud rate increases. In our case, latency + validation takes 21.5ms and the top efficiency at different baud rates would be:
|Baud rate||TX time (24K)||Efficiency||Gain|
June 2021 update: Additional videos available here.
Simple “Hello world!”
Video of the V1.2 ROM loading a sample program at 9600bps. Download as MP4 file.