ArduSat XRAMFS Prototyping

It is not every day that I get to tell the family I’m doing “rocket science”, but I hope over the past few days, it can be an exception. Space, the final frontier. In this case, it was a lack of space and the frontier it creates, that got me thinking.

At the recent Linux Conf AU Jon Oxer spoke about Freetronics’ efforts in designing the payload for the upcoming NanoSatisfi ArduSat1 launch (pictured below). Jon mentioned in the presentation that the AVR freeRTOS code compilation that I’ve been supporting is being used in the Supervisor node of that platform.

Ardusat_payload_freetronics

I immediately thought that it would be great to build a distributed cache RAM system to support each of the ATmega328p “Arduino” Client nodes, using the XRAM capabilities of the ATmega2561 Supervisor node. So, I did.

P1030071
P1030068

Using this prototype system, each Arduino Client node now has sole access to 32kByte of XRAMFS in addition to their 2kByte of internal RAM.

Initial performance measured is 422kByte/s throughput for the swap function. In other words, half of the entire Arduino RAM can be swapped with the contents of XRAMFS in just 4.74ms.

I’ve also the code for supporting a centralised SD Card on this platform to Sourceforge AVRfreeRTOS, and written about it at ArduSat SD Card Prototyping.

Background

In working with the Arduino hardware I’ve found that the severe limitation in RAM space causes constraints on what can be done. For example, Ethernet, USB and other modern applications need kBytes of buffer to be used effectively, and the ATmega328p used as the Arduino Uno platform supports a total of only 2kB RAM.

Using the Arduino Mega (or Android ADK hardware) has been the saviour of that situation for me, offering an identical environment, but 8kByte of RAM as a playground. And, most importantly, the ability to directly connect 0 wait-state external SRAM.

This XRAM capability of the ATmega2560 and ATmega2561 has been exploited by Rugged Circuits in their QuadRam module, which offers 512kByte of SRAM in one small package.

P1030069

Therefore, using common off the shelf technology, I had the materials available to test the theory that building a XRAMFS system, to support the ArduSat platform, would work.

This allows each ArduSat Client to store 16 TIMES more data than it can currently access, and have access to that data at a relatively high speed from a medium not subject to wear (such as for example an SD card).

Ingredients & Build

This section looks at the ingredients and how to construct the prototype.

Supervisor Node – Arduino Mega / Freetronics EtherMega / Android ADK

The ArduSat Supervisor node is based on the ATmega2561 MCU, because it is significantly smaller than the ATmega2560 MCU used in the Arduino Mega platform. The only difference between the two chips is that the ATmega2561 doesn’t provide as many Ports, and has only 64 Pins versus 100 Pins on the ATmega2560.

P1030070

For this prototyping, the ATmega2560 is necessary, because I elected to use pin change interrupts as part of the bus protocol. Also, the Arduino Mega platform is readily available. I don’t even know where I’d go to get a ATmega2561 board…

The use of rainbow hook-up wire was essential for the success of the prototype.

Client Node – Arduino Uno / Freetronics Eleven

The ArduSat Client node is designed to be identical to the Arduino Uno platform, to ensure that it is absolutely easy for people to test code they intend to run in space. Therefore a variety of Arduino Uno devices are being used (basically, whatever I had around).

XRAM Module – Rugged Circuits QuadRAM

I’ve implemented using the Rugged Circuits QuadRAM and the MegaRAM previously. These modules slip over the end of the Arduino Mega platform, instantly enabling either 512kByte or 128kByte of zero wait state SRAM, mapped to the system address space. They also conveniently bring out the SPI interface onto through-hole for pins.

Ad200

Something about the ability to create 16x 32kByte XRAM pages, linked with 16x Client nodes, seemed like synchronicity.

Layout

The prototype platform is designed to be the classic multi-slave SPI bus layout. This design is demonstrated in the AVR151 document and, in excerpt, is produced below.

Spi_wiring

Because of my decision to use the Pin Change Interrupts as part of the bus protocol, The Supervisor node (SPI Master) would use the Port K and Port J pins to fill the role of individual Slave Select (SS) pins. The Client nodes would each use their normal SS pin (PB2) to connect to the Supervisor.

In designing for 16x Client nodes, there is a limitation on Port J in that the good folks at Arduino determined not to break out all of the pins which, together with sharing PCINT8 with the Rx0 pin, significantly limits the number of Clients feasible on the prototype platform.

In practice, 8 Client nodes attached to all the pins on Pork K is the simple alternative. As luck (or good planning) would have it, those pins are all brought out onto one connector on the Arduino Mega platform, as evidenced by these pictures.

Amongst friends, a direct connection of the SPI SCK, MISO, and MOSI lines to all Clients is optimal. But in a shared environment, it would make sense to use FET bus isolation to keep Clients from physically attaching to the SPI bus until their SS line is held low by the Supervisor. A gram of hardware prevention can cure a tonne of software ill, as a “rogue” Client could otherwise potentially lock up the SPI bus for all, and the guys in the ISS won’t be happy if asked to hit the reset button.

Bus Protocol

Hey! – Yeah What? – This! – OK

That’s the protocol. Works in the home. Works in the office. Works the world over.

Read_overviewRead_middleviewRead_detail

Information to this Saleae Logic chart below in Client Implementation section.

Hey!

The Supervisor node holds all the PCINT pins high. If a Client wants to initiate a Read/Write/Swap transaction, it will pull its SS line low for 30µs. This needs to be long enough for the Supervisor to register an interrupt and process it. If multiple Clients call out simultaneously, no problem, the Supervisor will grab all of the requests and push them onto a queue of requests to serve.

Yeah What?

At the next opportunity, the Supervisor serving task will pop a request off the queue, and identify which Client made the request. It will also check if there were other simultaneous requests, and push them back to the front of the queue.

The Supervisor then pulls the relevant Client SS line low. The Client has been listening for this, and at this point it enables its Slave interface to the SPI bus, and the two swap acknowledgements. When the Supervisor receives the ACK code, it knows the Client is ready, so it requests a command.

This!

When the Client (SPI Slave) has received the Supervisor ACK code, it prepares a command, and is prepared to either read, write or swap XRAMFS data under the command of the Supervisor (SPI Master).

The command set implemented by this protocol can be easily extended to include accessing other shared resources connected to the Supervisor node. This could include analogue sensors, SDCARD mass storage (though using the SPI bus would offer a degree of complexity), or serial interfaced devices.

OK

At the end of one command, with the data transaction complete, a final byte is exchanged to ensure that the Client has remained in sync with the Supervisor, and the SPI bus is released by the Client. It is important the Client stays off the SPI bus. The Supervisor then processes the next Yeah What? request.

Supervisor Implementation – freeRTOS

The Supervisor is implemented as a freeRTOS task, using standard SPI bus libraries contained in my code base. These libraries (now that this project has worked them over) are about as optimised as is possible to write in C, and achieve a good throughput over the SPI bus.

There are two (or one) PCINT based Interrupt that reads the PCINT pins and pushes the raw pin state onto a queue. This process traps multiple simultaneous requests, overcoming any interrupt masking or race conditions. Currently 30µs are allowed for the interrupts to execute. 10µs has been tested, but depending on how long the Supervisor stays in “Critical” state (interrupts off) processing other (non XRAMFS) tasks this time can be adjusted.

From idle, the Supervisor takes only 90µs to 0.1ms to pop a request from the queue and action it. Under load, it could take as long as 64ms to action a request. As soon as the pin state is collected it is processed to identify which SS line triggered the call, and therefore which bank of XRAM should be enabled. Also, at this time I check that no additional requests are pending from the same pin state. If so, the remaining pin state is pushed back on the queue to get next time round.

The exchange of acknowledgements ensures that both sides are speaking SPI, and are set to proceed.

The command contains the action (read / write / swap / test), the address of the XRAMFS block, the size of the XRAMFS block, and a CRC byte.

The bus transaction speed is dependent on the SPI Master SCK clock divisor. Optimally, a SPI Slave can receive data at 1/4th of its system clock. Currently, it is set to one 1/8th, therefore theoretical performance is double that of the logic capture above.

Initially, I determined to calculate a CRC byte to store along with the data, but the calculation time is large compared to the transaction time, and therefore too costly to implement at the protocol level. The application should utilise the CRC when it recovers data to confirm that the data is intact, and not irradiated.

Also, error checking following the transfer could be implemented. But at this stage I think it is better to have the Client do all sanity and error checking of its own data.

Client Implementation – freeRTOS or Arduino IDE

The Client is implemented in freeRTOS as a simple library function, that is passed a command structure, and a pointer to local RAM to be Read/Write/Swap. Some details below.

typedef enum { Huh        = 0, // Client didn't issue us a command, so just break.
               Read       = 1, // read from XRAMFS
               Write      = 2, // write to XRAMFS
               Swap       = 3, // read from both XRAMFS & local RAM, and swap
               Test       = 4  // do something else, to be determined
} RAMFSCommand; // from point of view of the client (Arduino 328p)

typedef struct        /* structure to hold the RAMFS info */
{ RAMFSCommand       ram_cmd;        // Read / Write / Swap / Test
  size_t             ram_addr;       // Address of first byte of RAM in a RAMFS (greater than RAM_START_ADDR)
  uint16_t           ram_size;       // Size of RAM block in RAMFS (less than RAM_COUNT or 32kByte)
  uint8_t            ram_crc8;       // Calculated CRC of stored data
} xRAMFSarray, * pRAMFSarray;

uint8_t ramfs_transfer_block( pRAMFSarray pRAMFS_block, uint8_t *data );

I used C and the freeRTOS platform because it is easiest for my environment, and I know it best. But, I’ll re-write it as a library in the Arduino IDE environment as needed. It won’t be too hard.

The client can use the XRAMFS malloc function to manage RAM allocation. A very simple malloc has been built, which can’t free XRAMFS. But, it can be simply ignored if desired and the command structure can be filled manually.

Initially, I implemented an interrupt driven semaphore system to manage the Yeah What? part of the bus protocol, but typically the Supervisor responds so quickly that the time to do several context swaps generated by the interrupt exceeded the time the Supervisor was prepared to wait. A simple wait loop keeps the Client on ready standby for 90µs so it can complete the transaction in the shortest time.

The Client code has no knowledge of where its XRAM is located on the Supervisor. Therefore the code is orthogonal and constant, irrespective which Client being used. This is a very useful feature where the author may not know in advance which ArduSat Client his code will be running upon.

Client application code should be written to make use of the Swap XRAMFS <-> RAM capability. This makes best use of the SPI bus features to combine Read and Write into one transaction, effectively doubling throughput over the Write plus Read combination.

The user interface (monitor) is just for initial testing. I’ll have to write a load generation rig to find out what this baby can do, but that can wait for the next post. The logic analyser has captured the result of the > r (read) command in the below command line sequence. We can see the 20µs (now 30µs) Hey! on the Slave Select, 90µs pass before the acknowledgement bytes are swapped (only one cycle needed), 6 bytes of command structure are passed (Read command is 0x01), and then the data is read out of XRAMFS to the Client.

Terminal

Design Notes

The basis of every design: detailed functional specifications, hardware design, and user interface documentation. Oh, and scribbles much.

P1030072

Updates

I’ve updated the code on 22 February to remove some oversights in the Client main program, and added the OK check byte to the protocol. Code as usual on AVRfreeRTOS on Sourceforge.

Updated on 23 February to include some error checking on Supervisor side (preventing malicious Client requests), and on Client side preventing hang if the Supervisor is AWOL. Also removed the aggressive SPI timing utilising receive double buffering, as it often caused errors, and had no performance effect.

Initial performance measured is about 422kByte/s throughput for the swap function. Specifically 4.73825ms is needed for a complete 2048Byte data payload transaction (including sync, command, & OK timing). This also includes freeRTOS task swapping, as the Supervisor task is run with interrupts enabled in normal mode.

Have fixed some code issues on 4 March, mainly around a few µs delays required to let things run their course.

Now the platform is running stable with 4x Clients. A video is here

And here is a screenshot of the 4x terminals.

4xXRAMFS Client Monitors Screenshot

April 27th – I’ve uploaded the code for supporting a centralised SD Card on this platform to Sourceforge AVRfreeRTOS, and written about it at ArduSat SD Card Prototyping.

Windows 7 Starter with (up to) 128GB RAM

So most of my computers run a version of Ubuntu, Debian or Android. Used to be contrarian, but these days seems more devices are built on Linux kernels than almost any other type, so I’m just part of the mainstream.

Unfortunately, I also love the occasional First Person Shooter video game, or spend hours in an immersive Virtual Reality environment. It is an addiction, which I can mostly control. But sometimes, well, game on.

The one thing that Linux doesn’t do well is gaming. All the best games these days are released on the DX10 or DX11 platform on Microsoft Windows 7. So, like an alcoholic with a whiskey stash, I need to keep a Windows version stashed somewhere to appease the addiction.

Recently, I purchased a HP Netbook. The Windows 7 system it came loaded with was erased within 90 minutes of unboxing, and that was that.

When it came to rebuilding and upgrading my gaming machine, I thought why not repurpose that Windows 7 Starter licence as my DX11 gaming platform. A few hours later, with my shiny new Windows 7 Starter SP1 was installed, validated, and perfectly legal, I found only two issues with the Window 7 Starter SP1 operating system.

  1. No Aero. Being used to Compiz on Ubuntu, being stuck on one screen with few decorations seemed pretty last millennium. But since this is a gaming stash, that I will always be playing full screen, there is really no downside.
  2. Limit to 2GB RAM. The Windows 7 Starter version does not support more than 2GB of RAM, though other versions reportedly support up to 4GB. This was a problem for me, as my machine runs much more RAM, and I hate waste.

Microsoft provides this table describing the limits on X86 (32bit) Windows 7.

Version Limit on X86 Limit on X64
Windows 7 Ultimate 4 GB 192 GB
Windows 7 Enterprise 4 GB 192 GB
Windows 7 Professional 4 GB 192 GB
Windows 7 Home Premium 4 GB 16 GB
Windows 7 Home Basic 4 GB 8 GB
Windows 7 Starter 2 GB 2 GB

The dirty little secret

The secret that Microsoft doesn’t want to tell you is that 32-bit editions of Windows 7 are limited to 4GB is not because of any technical constraint on 32-bit operating systems. All the 32-bit editions of Windows 7 contain the code required for using physical memory above 4GB. Microsoft just doesn’t license you to use that code.

I sourced the information I’m quoting from Geoff Chappell’s web site, and I’ve found it to be completely true.

There are other resources on the Internetz and Torrentz that provide some patch code together with mysterious installers that may or may not address the memory limitation issue, too. But, since we’re dealing with the Kernel of my operating system, I was not sure that they would or would not add any Trojans, Malware, or similar. So let’s stay away from that stuff.

Following the recipe

Following Greg’s recipe to create a kernel is relatively simple. He lists all the things to do very clearly. I have extracted some of his words below, and modified them for my simple minded clarity.

The only executable that we’re going to touch is the PAE kernel, named NTKRNLPA.EXE, from 32-bit editions of Windows 7 SP1. The known builds have a routine named MxMemoryLicense in which there are two sequences of nearly identical code, one for each relevant license value. Each sequence calls the undocumented function ZwQueryLicenseValue and then tests for failure or for whether the data that has been read for the value is zero. In the known builds, each sequence has the following instructions in common:

Opcode Bytes Instruction
7C xx
jl      default
8B 45 FC
mov     eax,dword ptr [ebp-4]
85 C0
test    eax,eax
74 yy
je      default

So the idea is to get A COPY of your NTKRNLPA.EXE and rename it ntkr128g.exe This is the file you are going to patch. Use a Hex/Byte editor to search for the byte string 8B 45 FC 85 C0 74. You will find approximately 25 occurrences of this byte string within the ntkr128g.exe file found in my Windows 7 Starter SP1 system. You need the PATCH THE LAST TWO OCCURRENCES ONLY.

Both occurrences are to be patched the same way. The patch is designed to vary the ordinary execution as little as possible. The kernel is left to call ZwQueryLicenseValue as usual and to test for failure, but the last three of the above instructions are changed so that the kernel proceeds as if the retrieved data is the value that represents 128GB (which is the least value that removes licensing from the kernel’s computation of maximum physical address). Change the 7 bytes starting from 0x8B so that you now have the following instructions:

Opcode Bytes Instruction
B8 00 00 02 00
mov     eax,00020000h
90
nop
90
nop

This means that you replace the last two occurrences of 8B 45 FC 85 C0 74 yy with B8 00 00 02 00 90 90 in the file. Great. You’re done. Well not really. You have to follow Greg’s instructions to add a digital signature to the kernel, so that it can boot properly.

To add the digital signature, you need to download the Microsoft Software Development Kit, and install the tools (only tools required) to get access to the certification and signing tools to enable the kernel to boot. I did it. It is not hard. Just a little time consuming.

In Test Mode, the loader relaxes its integrity checking such that any root certificate is accepted. For suitable tools, with documentation, look in either the Windows Software Development Kit (SDK) or the Windows Driver Kit (WDK). To make your own certificate, run some such command as

makecert -r -ss my -n "CN=On My Authority"

This creates a root certificate for an invented certification authority named “On My Authority” and installs it in the Personal certificate store, which is represented by “my” in the command. You can view the new certificate by starting the Certificate Manager (CERTMGR.MSC), which also lets you set a Friendly Name for the certificate if you want to keep it. To sign your modified kernel with this certificate, run the command

signtool sign -s my -n "On My Authority" ntkr128g.exe

If you want to save some time, you can get a signed kernel here.

Once you have a kernel, then the rest is pretty straight forward. You need to use the bcdedit command to create an alternative booting information, following Greg’s instructions.

bcdedit /copy {current} /d "Windows 7 128GB"

Interesting commands to add to the kernel (which you put back in the same directory as the original NTKRNLPA.EXE, of course) are below

{guid} refers to the id of the BCD entry that you’re editing.

bcdedit /set {guid} kernel ntkr128g.exe
bcdedit /set {guid} testsigning on
bcdedit /set {guid} pae ForceEnable
bcdedit /set {guid} increaseuserva 3072

These instructions tell the BCD boot loader to:

  • Load the ntkr128g kernel.
  • Ignore that Microsoft has not signed it. Treat it as a test kernel.
  • Force on PAE (Physical Address Extension). Which is irrelevant really, as it is automatically turned on when DEP (Data Execution Prevention) is enabled, which is the default case for Windows 7.
  • Increase the maximum amount of memory that a single application can address to 3072MB (up from 2048MB).

Update

On December 13 2011 Microsoft released an advisory that updates the kernel.

The above kernel has been updated to reflect this change and is now version 6.1.7601.17713.

Everything is working as usual.

Update

On April 10 2012 Microsoft released an advisory that updates the kernel.

The above kernel has been updated to reflect this change and is now version 6.1.7601.17790.

Everything is working as usual.

Update

In August (around the 16th) Microsoft updated the kernel.

The above kernel has been updated to reflect this change and is now version 6.1.7601.17803.

Everything is working as usual.

Update

Around 24th October 2012 Microsoft released an an advisory that updates the kernel.

The above kernel has been updated to reflect this change and is now version 6.1.7601.17944.

Everything is working as usual.

Update

Around 13th April 2013 Microsoft updated the kernel.

Update

Around 2nd November 2013 Microsoft updated the kernel.

Right click and “Save link as…”  ntkr128g

The above kernel has been updated to reflect this change and is now version 6.1.7601.18247.

Everything is working as usual.

Update

Around June 2014,  I’ve converted my machine to Windows7 x64. Blame Titanfall requiring x64. So, sorry I’m not maintaining this kernel any further. As of June 2014 everything was working as normal.

Presto up to 128GB RAM

Once this kernel is booted, you will note that the screen comes up with a small note in the lower right corner of the Desktop that Windows 7 is in “Test Mode”, which is to no
te that you’re testing your own kernel. Well great. Thanks.

Test_mode

But the good news can be seen on the Resource Monitor, with 8GB of RAM showing.

Resource_monitor

That’s it. Testing with real games (eg Battlefield 3) show that over 2.3GByte can be allocated to one application. Game on!