Not so long ago made an XMODEM serial bootloader for the SEGA Master System
(link) and while
it works pretty well, it has a few restrictions like upload size (being
written mostly in C requires some RAM for working) and the uploaded ROM being
able to bootstrap and relocate itself into memory once chainloaded. This works
just fine to upload small programs to RAM but it feels bad not being able to
use the full 8KB of RAM for the uploaded ROM. This would require the
bootloader itself to use exactly Zero
0 bytes of RAM.
As described above, the objective is creating an XMODEM bootloader that uses zero bytes of RAM during its execution to allow uploading ROMs up to 8KB, the entire Master System RAM. In short:
- Use 0 Bytes of RAM
- Use 48K ROM at most (no mapper)
- No cheating! VRAM used exclusively for Video
- Written in pure Z80 Asm
- Implement the XMODEM-CRC variant
- Chainload cartridge/card if upload fails (BIOS replacement)
The side effects of these restrictions are:
- No interrupts (
NMIwill reset the console)
- All function calls become inlined macros
- Loop unrolls everywhere
- Succesful XMODEM CRC upload (V1.0)
- Speed-up to 9600bps (V1.1)
Recovery from faulty transmission Chainload other media
The following section will detail most of the implementation features and some challenges that have appeared.
While RAM will not be used, the Z80 has a healthy ammount of registers at my
HL and their shadow counterparts will give me 14
bytes of storage, these are more or less “directly” accesible. The
register will not be useful for its original purpose (no ram usage, no stack to
point at) and can be moved back-and-forth to the
HL register like this:
;SP <- HL LD SP, HL
;HL <- SP LD HL, #0x00 ADD HL, SP
This gives us two more bytes of memory budget, yielding a current total of 16
bytes of memory. Now, the
IY registers can be freely moved via
undocumented opcodes as if they were the
HL register. Prepending a
0xFD byte to a
LD opcode sourcing/targetting the
will convert them to
IYL loads. For example:
;A <- IXL .db 0xDD LD A, L
;IXL <- A .db 0xDD LD L, A
Access to these increments our budget in 4 bytes, giving a grand total of
20 bytes of memory. So far so good, but just in case, let’s scrape the bottom
of the memory barrel and gain an extra bit (yes bit) of memory to signal some
binary event, like a success/fail return value. The obvious candidate is the
C since it can be easily set-reset like this:
;Set carry flag JR C, flag_set CCF flag_set:
;Clear carry flag JR NC, flag_clear CCF flag_clear:
Otherwise it can be easily set/cleared while at the same time clearing
;Clear both A and C XOR A, A ;Set C CCF
So our final memory budget is a grand total of: 20 bytes (+2 carry bits)
To check the integrity of each packet, XMODEM (CRC variant) uses a 16bit CRC algorithm. Since at most 32 packets will be needed for a successful transfer the prime target is wasting no registers, speed and ROM usage be damned… To an extent…
The reference C implementation uses either a bit-by-bit approach or a table
based approach. The bit-by-bit implementation in Assembler managed to use only
three (four bytes) registers,
HL where the previous CRC is stored and where
the next CRC value will be stored later, the
A register, which contains the
next byte, and the
A' register for some minor tests. Nothing out of the
For the table-based approach I wanted to trade off extra ROM usage and free up
A' register. That means the CRC implementation can only use the
HL occupied by the current/next CRC, and the
A register, that
contains the next byte. That means no indirect addressing nor calculated
jumps since these would require the use of another
HLregister. The solution
is less than pretty, but it works! And it ain’t too slow either.
The macro contains a
crc16_word_table with 256 16-bit values, used for the
CRC calculation. Then a macro that decrements the value of
A, tests if it is
zero, and if so apply the corresponding CRC operation for that bit.
;Macro crc16_test_byte(table_index) ;If A-- is zero, apply CRC. DEC A JR NZ, skip_byte ;if not, aplly CRC LD A, L LD HL, (#crc_table + (2 * table_index)) XOR A, H LD H, A ;Then clear A so no other subsequent macro ;corrupts the current CRC value XOR A, A skip_byte:
Cascading 256 instances of the macro, and incrementing the
on each instance, we can calculate the CRC on zero extra registers.
;Current CRC on HL, next byte on A ;Calculate next table index using A XOR A, H ;Index stored on A ;Each macro call decrements A til it finds the correct ;table index without using any more regs :3 crc16_test_byte 1 crc16_test_byte 2 [...] crc16_test_byte 255 crc16_test_byte 0 ;Next CRC on HL
Is not by any means pretty, and using this macro will add ~2KB of code to the ROM on each invocation but since it will only be needed on a single place, it can be justified.
The usual way to bit-bang UART is by making a delay loop. Grab some register, set a value, and waste some time running NOPs, however this solution is very “wasteful” in this use case, since registers and/or memory is something we do not have, hence the delay loop shall be unrolled.
The lower the baudrate, the bigger the unrolled delay loop will be but for
compatibility with the previous bootloader, 4800bps will be the aim. For the
delay loop itsels we need a “safe” opcode, meaning no memory/register writes.
The candidates are
BIT and direct
Note: Compatibility not an issue any more, serial port now runs at 9600bps but the calculations on this section refer to the 4800bps version.
BIT b, (YX+n) (~6 T-States per byte) the UART test ROM
weights 4490 bytes. Using
CP (HL) +
NOP it weights 3710 bytes. Each
instance of the UART TX macro will use slightly above 1KB of ROM, making
it quite expensive but not prohibitely so.
Complementary to the transmission macros, the reception macro is marginally more tricky. Macro shall detect a low level on the RX line, wait for one and a half bit, then sample the 8 bits.
----------+ +-----+-----+-----+-----+-----+-----+-----+-----+----- | S | | | | | | | | | S Detect st | T | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | T bit ----> | A | | | | | | | | | O +-----+-----+-----+-----+-----+-----+-----+-----+-----+ Wait half | | | | | | | | | | bit ---------^ | | | | | | | | | | | | | | | | | | | Wait 1 bit & | | | | | | | | | | sample ------^-----^-----^-----^-----^-----^-----^-----^-----^-----^
Macro returns the read value on
A and sets
C on success. If the start bit
cannot be detected, macro returns error by clearing
C bit. Like the UART TX
macro, every instance of this will cost you ~1KB of ROM. Register usage is the
A register and the
After some tuning, the sampling on the logic analyzer looked pretty good.
At its barest minimu, XMODEM needs to keep track of the current packet, the
current byte, and a pointer to the current RAM address being written. For this
BC register is reserved for Byte and paCket state, and for the
memory pointer, the
IX register is being used. For the rest of “State” the
program flow itself and the use-for-everything registers
The basic XMODEM (CRC) flow is as follows.
Like it was said, a state machine controling this flow needs essentially no memory. The exceptional cases not listed on the diagram would be trying to load a packet beyond the 8KB mark, on such case the bootloaders should stop. And the event where the transfer never starts, on such case the bootloader may either stop or try to load the next media, essentially acting like a BIOS replacement.
With the implementation completed, the throughput is ~785 Bytes/s, including XMODEM overhead, the Master System checking the CRC, and the PC/SMS latency. In the end it gives sround an 82% channel usage from a 960 Bytes/s maximum capacity.
For the curious, here is a complete XMODEM transmission under the logic analyzer. At any given time there is only one outstanding byte on both lines, making implementation quite easy.
Video of the first working ROM, at 4800bps. Download as MP4 file.