Three Rings for the Z80

Over the past few years I’ve implemented a number of interfaces for Z80 peripherals based on the principal of the interrupt driven ring buffer. Each implementation of a ring exhibits its own peculiarities, based on the specific hardware. But essentially I have but one ring to bring them all and in the darkness bind them.

This is some background on how these interfaces work, why they’re probably fairly optimal at what they do, and things to consider if extending these to other platforms and devices.

The ring buffer is a mechanism which allows a producer and a consumer of information to do so with a timing to suit their needs, and to do it without coordinating their timing.

The Wikipedia defines a circular buffer, or ring buffer,  as a data structure that uses a single fixed-size buffer as if it were connected end-to-end. The most useful property of the ring buffer is that it does not need to have its elements relocated as they are added or consumed. It is best suited to be a FIFO buffer.

Background

Over the past few years, I’ve used the ring buffer mechanism written by Dean Camera in many AVR projects. These include interrupt driven USART interfaces, a digital audio delay loop, and a packet assembly and play-out buffer for a digital walkie-talkie.

More recently, I’ve been working with Z80 platforms and I’ve taken that experience into building interrupt driven ring buffer mechanisms for peripherals on the Z80 bus. These include three rings for three different USART implementations, and a fourth ring for an Am9511A APU.

But firstly, how does the ring buffer work? For the details, the Wikipedia entry on circular buffers is the best bet. But quickly, the information (usually a byte, but not necessarily) is pushed into the buffer by the producer, and it is removed by the consumer.

The producer maintains a pointer to where it is inserting the data. The consumer maintains a pointer to where it is removing the data. Both producer and consumer have access to a count of how many items there are in the buffer and, critically, the act of counting entries present in the buffer and adding or removing data must be synchronised or atomic.

8 Bit Optimisation

The AVR example code is written in C and is not optimised for the Z80 platform. By using some platform specific design decisions it is possible to substantially optimise the operation of a general ring buffer, which is important as the Z80 is fairly slow.

The first optimisation is to assume that the buffer is exactly one page or 256 bytes. The advantage we have there is that addressing in Z80 is 16 bits and if we’re only using the lowest 8 bits of addressing to address 256 bytes, then we simply need to align the buffer onto a single 256 byte page and then increment through the lowest byte of the buffer address to manage the pointer access.

If 256 bytes is too many to allocate to the buffer, then if we use a power of 2 buffer size, and then align the buffer within the memory so that it falls on the boundary of the buffer size, the calculation for the pointers becomes simple masking (rather than a decision and jump). Simple masking ensures that no jumps are taken, which means that the code flow or delay is constant no matter which place in the buffer is been written or read.

Note that although the number of bytes allocated to the buffer is 256, the buffer cannot be filled completely. A completely full 256 byte buffer cannot be discriminated from a zero fullness buffer. This does not apply where the buffer is smaller than the full page.

With these two optimisations in place, we can now look at three implementations of USART interfaces for the Z80 platform. These are the MC6580 ACIA , the Zilog SIO/2, and the Z180 ASCI interface. There is also the Am9511A interface, which is a little special as it has multiple independent ring buffers, and has multi-byte insertion.

Implementations

To start the discussion, let us look at the ACIA implementation for the RC2014 CP/M-IDE bios. I have chosen this file because all of the functions are contained in one file, which provides an easier overview. The functions are identical to those found in the z88dk RC2014 ACIA device directory.

Using the ALIGN key word of the z88dk, the ring buffer itself is placed on a page boundary, in the case of the receive buffer of 256 bytes, and on the buffer size boundary, in the case of the transmit buffer of 2^n bytes.

Note that although where the buffer is smaller than a full page all of the bytes in the buffer could be used, because the buffer counter won’t overflow, but I haven’t made that additional optimisation in my code. So no matter how many bytes are allocated to a buffer, one byte always remains unused.

Once the buffer is located, the process of producing and consuming data is left to either put or get functions which write to, or read from the buffer as and when they choose to. There is no compulsion for the main program flow to write or read at a particular time, and therefore the flow of code is never delayed. This is optimum from the point of view of minimising delay and maximising compute time. Additional functions such as flushpeek, and poll are also provided to simplify program flow, and init to set up the peripheral and initialise the buffers on first use.

With the buffer available then the interrupt function can do its work. Once an interrupt from the peripheral is signalled, the interrupt code checks to see whether a byte has been received. If not then the interrupt (in the case of the ACIA and ASCI) must have been triggered by the transmit hardware becoming available.

If in fact a byte has been received by the peripheral then the interrupt code recovers the byte, and checks there is room in the buffer to store it. If not, then the byte is simply abandoned. If there is space, then the byte is stored, and the buffer count is incremented. It is critical that these two items happen atomically, which in the case of an interrupt is the natural situation.

If the transmission hardware has signalled that it is free, then the buffer is checked for an available byte to transmit. If none is found then the transmit interrupt is disabled. Otherwise the byte is retrieved from the buffer and written to the transmit hardware while the buffer count is decremented.

If the transmit buffer count reaches zero when the current byte is transmitted, then the interrupt must disable further transmit interrupts to prevent the interrupt being called unnecessarily (i.e. with the buffer fullness being empty).

Multi-byte Receive

Both the SIO and ASCI have multi-byte hardware FIFO buffers available. This is to prevent over-run of the hardware should the CPU be unable to service the receive interrupt in sufficient time. This could happen if the CPU is left with its general interrupt disabled for some time.

In this situation, the SIO receive interrupt and the ASCI interrupt have the capability to check for additional bytes before continuing.

Transmit cut-through

One additional feature worth discussing is the presence of a transmit cut-through, which minimises delay when writing the “first byte”. Because the Z80 processor is relatively slow compared to a serial interface, it is common for the transmit interface to be idle when the first byte of a sequence of bytes is written. In this situation writing the byte into the transmit buffer, and then signalling a pseudo interrupt (by calling the interrupt routine) would be very costly. In the case of the first byte it is much more effective simply to cut-through and write directly to the hardware.

Atomicity

For the ring buffer to function effectively, the atomicity of specific operations must be guaranteed. During an interrupt in Z80 further interrupts are typically not permitted, so within the interrupt we have a degree of atomicity. The only exception to this rule is the Z80 Non Maskable Interrupt (NMI), but since this interrupt is not compatible with CP/M it has never been used widely and is therefore not a real issue.

For the buffer get function the only concern is that the retrieval of a byte is atomically linked to the number of bytes in the buffer.

For the put function it is similar, however as the transmit interrupt needs to be enabled by the put function atomcity is required to ensure that this process is not interrupted.

Interrupt Mode

Across the three implementations there are three different Z80 interrupt modes in play. The Motorola ACIA is not a Zilog Z80 peripheral, so it can only signal a normal interrupt, and can therefore (without some dirty tricks) only work in Interrupt Mode 1. For the RC2014 implementation it is attached to INT or RST38 and therefore when an interrupt is triggered it is up to the interrupt routine to determine why an interrupt has been raised. This leads to a fairly long and slow interrupt code.

The Z180 ASCI has two ports and is attached to the Z180 internal interrupt structure, which works effectively similarly to the Z80 Interrupt Mode 2, although it is actually independent from the Z80 interrupt mode. Each Z180 internal interrupt is separately triggered, however it still cannot discern between a receive and a transmit event. So the interrupt handling is essentially similar to that of the ACIA.

The Zilog SIO/2 is capable of being attached to the Z80 in Interrupt Mode 2. This means that the SIO is capable of being configured to load the Z80 address lines during an interrupt with a specific vector for each interrupt cause. The interrupts for transmit empty, received byte, transmit error, and receive error are all signalled separately via an IM2 Interrupt Vector Table. This leads to concise and fast interrupts, specific to the cause at hand. The SIO/2 is the most efficient of all the interfaces described here.

Multi-byte buffers

For interest, the Am9511A interface uses two buffers, one for the one byte commands, and one for the two byte operand pointers. The command buffer is loaded with actions that the APU needs to perform, including some special (non hardware) commands to support loading and unloading operands from the APU FILO.

A second Am9511A interface also uses two buffers, one for one byte commands, and one for either two or four byte operands. This mechanism in not as nice as storing pointers as in the above driver, but is required for situations where the Z180 is operating with paged memory.

I’ve revised this above solution again and do it with three byte operand (far) pointers, as that makes for a much simplified user experience. The operands don’t have to be unloaded by the user. They simply appear auto-magically…

ATmega Arduino USART in SPI Master Mode MSPIM

The AVR ATmega MCU used by the Arduino Uno and its clones and peers (Leonardo, Pro, Fio, LilyPad, etc) and the Arduino Mega have the capability to use their USART (Universal Serial Asynchronous Receiver Transmitter), also known as the Serial Port, as an additional SPI bus interface in SPI Master mode. This fact is noted in the datasheets of the ATmega328p, ATmega32u4, and the ATmega2560 devices at the core of the Arduino platforms, but until recently it hasn’t meant much to me.

Over the past 18 months I’ve been working on an advanced derivative of the Arduino platform, using an ATmega1284p MCU at its core. I consider the ATmega1284p device the “Goldilocks” of the ATmega family, and as such the devices I’ve built have carried that name. Recently I have been working on a platform which has some advanced analogue output capabilities incorporating the MCP4822 dual channel DAC, together with a quality headphone amplifier, and linear OpAmp for producing buffered AC and DC analogue signals. This is all great, but when it comes down to outputting continuous analogue samples to produce audio it is imperative that the sample train is not interrupted or the music simply stops!

The issue is that the standard configuration of the Arduino platform (over)loads the SPI interface with all of the SPI duties. In the case of the Goldilocks and other Arduino style devices I have ended up having the MicroSD card, some SPI EEPROM and SRAM, and the MCP4822 DAC all sharing same SPI bus. This means that the input stream of samples from the MicroSD card are interfering and time-sharing with the output sample stream to the DAC. The MicroSD card has a lot of latency, often taking hundreds of milliseconds to respond to a command, whereas the DAC needs a constant stream of samples with no jitter and no more than 22us between each sample. That is a conflict that is difficult to resolve. Even using large buffers is not a solution, as when streaming audio it is easy to consume MBytes of information; which is orders of magnitude more than can be buffered anywhere on the ATmega platform.

Other solutions using a DAC to generate music have used a “soft SPI” and bit-banging techniques to work around the issue. But this creates a performance limitation as the maximum sample output rate is strongly limited by the rate at which the soft SPI port can be bit-banged. There has to be a better way.

USART in SPI mode

The better way to attach SPI Slave devices to the ATmega platform is referenced in this overlooked datasheet heading: “USART in SPI mode”. Using the USART in “Master SPI Mode” (MSPIM) is may be limiting if you need to use the sole serial port to interact with the Arduino (ATmega328p), but once the program is loaded (in the case of using a bootloader) there is often no further need to use the serial port. But for debugging if there is only one USART then obviously it becomes uncomfortable to build a system based on the sole USART in SPI mode.

However in the case of the Goldilocks ATmega1284p MCU with two USARTs, the Arduino Leonardo with both USB serial and USART, and the Arduino Mega ATmega2560 MCU with four USARTs, there should be nothing to stop us converting their use to MSPIM buses according to need.

Excuse me for being effusive about this MSPIM capability in the AVR ATmega. It is not exactly a secret as it is well documented and ages old, but it is a great feature that I’ve simply not previously explored. But now I have explored it, I think it is worthwhile to write about my experience. Also, I think that many others have also overlooked this USART MSPIM capability, because of the dearth of objective review to be found on the ‘net.

Any ATmega datasheet goes into the detailed features and operation of the USART in SPI mode. I’ll go into some of the features in detail and what it means for use in real life.

  • Full Duplex, Three-wire Synchronous Data Transfer – The MSPIM does not rely on having a Slave Select line on a particular pin, and further it doesn’t rely on having both MOSI and MISO lines active at the same time. This means that it is possible to attach a SPI Slave device that doesn’t use the _SS to begin or end transactions with just two pins, being the XCK pin and the Tx pin. If a _SS is required (as in the MCP4822) then only three wires are required. The fact that the MISO (Rx) pin is optional saves precious pins too.
  • Master Operation – The MSPIM only works in SPI Master mode, which means that it is only really useful for connecting accessories. But in the Arduino world, that is what we are doing 99% of the time.
  • Supports all four SPI Modes of Operation (Mode 0, 1, 2, and 3) – yes, it does.
  • LSB First or MSB First Data Transfer (Configurable Data Order) – yes, it does.
  • Queued Operation (Double Buffered) – The MSPIM inherits the USART Tx double buffering capability. This is a function not available on the standard SPI interface and is a great thing. For example, to output a 16bit command two writes to the I/O register can follow each other immediately, and the resulting XCK has no delay between each Byte output. To output a stream of bytes the buffer empty flag can be used as a signal to load the next available byte, ensuring that if the next byte can be loaded with 16 instructions then we can generate a constant stream of bytes. In contrast with the standard SPI interface transmission is not buffered and therefore in Master Mode we’re invariably wait-looping before sending the next byte. This wastes cycles in between each byte in recognising completion, and then loading the next byte for transmission.
  • High Resolution Baud Rate Generator – yes, it is. The MSPIM baud rate can be set to any rate up to half the FCPU clock rate. Whilst there may be little need to run the MSPIM interface at less than the maximum for pure SPI transactions, it is possible to to use this feature, together with double buffered transmission, to generate continuous arbitrary binary bit-streams at almost any rate.
  • High Speed Operation (fXCKmax = FCPU/2) – The MSPIM runs at exactly the same maximum clock speed as the standard SPI interface, but through the double buffering capability mentioned above the actual byte transmission rate can be significantly greater.
  • Flexible Interrupt Generation – The MSPIM has all the same interrupts as the USART from which it inherits its capabilities. In particular the differentiation between buffer space available flag / interrupt and transmission complete flag / interrupt capabilities make it possible to develop useful arbitrary byte streaming solutions.

Implementation Notes

As the USART in normal mode and the USART in MSPIM mode are quite similar in operation there is little that needs to be written. The data sheet has a very simple initialisation code example, which in practice is sufficient for getting communications going. I would note that as there is no automatic Slave Select management, the _SS line needs to be manually configured as an output, and then set (high) appropriately until such time as the attached SPI device is to be addressed. Also note that the XCKn (USART synchronous clock output) needs to be set as an output before configuring the USART for MSPIM. And also to note that the transmission complete flag (TXCn) is not automatically cleared by reading (it is only automatically cleared if an Interrupt is processed), and needs to be manually cleared before commencing a transmission (by writing a 1 to the TXCn bit) it you are planning to use it to signal transaction completion in your code. The Transmit and Receive Data Register (UDRn) is also not automatically cleared, and needs to be flushed before use if a receive transaction is to synchronised to the transmitted bytes.

So the implementation of a simple initialisation code fragment looks like this:

SPI_PORT_DIR_SS |= SPI_BIT_SS;   // Set SS as output pin.
SPI_PORT_SS |= SPI_BIT_SS;       // Pull SS high to deselect the SPI device.
UBRR1 = 0x0000;
DDRD |= _BV(PD4);                // Setting the XCK1 port pin as output, enables USART SPI master mode (this pin for ATmega1284p)
UCSR1C = _BV(UMSEL11) | _BV(UMSEL10) | _BV(UCSZ10) | _BV(UCPOL1);
                                 // Set USART SPI mode of operation and SPI data mode 1,1. UCPHA1 = UCSZ10
UCSR1B = _BV(TXEN1);             // Enable transmitter. Enable the Tx (also disable the Rx, and the rest of the interrupt enable bits set to 0 too).
                                 // Set baud rate. IMPORTANT: The Baud Rate must be set after the Transmitter is enabled.
UBRR1 = 0x0000;                  // Where maximum speed of FCPU/2 = 0x0000

And a fragment of the code to transmit a 16 bit value looks like this. Note with this example there is no need to wait for the UDREn flag to be set between bytes, as we are only writing two bytes into the double transmit buffer. This means that the 16 clocks are generated on XCKn with no gap in delivery.

UCSR1A = _BV(TXC1);              // Clear the Transmit complete flag, all other bits should be written 0.
SPI_PORT_SS &= ~_BV(SPI_BIT_SS); // Pull SS low to select the SPI device.
UDR1 = write.value.u8[1];        // Begin transmission of first byte.
UDR1 = write.value.u8[0];        // Continue transmission with second byte.
while ( !(UCSR1A & _BV(TXC1)) ); // Check we've finished, by waiting for Transmit complete flag.
SPI_PORT_SS |= _BV(SPI_BIT_SS);  // Pull SS high to deselect the SPI device.

Results

Looking at the output generated by the two different SPI interfaces on the AVR ATmega, it is easy to see the features in action. In the first image we can see that the two bytes of the 16 bit information for the DAC are separated, as loading the next byte to be transmitted requires clock cycles AFTER the transmission completed SPIF flag has been raised.

DAC control using SPI bus.

DAC control using SPI bus.

In the case of the MSPIM output, we can’t recognise where the two bytes are separated, and the end of the transaction is triggered by the Transaction complete flag. This example shows that the MSPIM can be actually faster than the standard SPI interface, even though the maximum clock speed in both cases is FCPU/2.

DAC control using USART MSPIM bus.

DAC control using USART MSPIM bus.

The final image shows the Goldilocks DAC generating a 44.1kHz output signal, with dual 12 bit outputs. Whilst this is not fully CD quality, comparisons with other DAC solutions available on the Arduino platform have been favourable.

44.1kHz samples using USART MSPI output.

44.1kHz samples using USART MSPI output.

Conclusion

I am now convinced to use the USART MSPIM capability for the Goldilocks Analogue, and I think that it is time to write some generalised MSPIM interface routines to go into my AVRfreeRTOS Sourceforge repository to make it easy to use this extremely powerful capability.