NES AI - AI running on Nintendo NES

I was at a comic convention, surrounded by the buzz of the crowd, stands full of colorful gadgets, and cosplayers. In the middle of all this, there is always that special corner, a little nostalgic and a little magical: the retrogaming area. It was there, in front of a flickering CRT displaying Super Mario Bros., that an idea was born.

As I looked at those pixelated images on the screen, I wondered, “What if we could bring artificial intelligence, such a hot topic nowadays, to this 40 years old hardware?”. I’m not talking about a simple if-else script that simulates an enemy, but a real Neural Network capable of seeing and recognizing images.

I did some research on Google and came across Neurona, a project for Arduino UNO that attempted to run a MultiLayer Perceptron (MLP) model on hardware with limited resources (ATmega328p, AVR architecture, 16MHz, 2KB RAM). However, there were some major unknowns: ATmega328p has hardware instructions for multiplication, which are absent in MOS 6502 (NES CPU), and the project had only been tested with a few dozen parameters. Is it possible to optimize it to the point where a more complex model can run on even less powerful hardware than this?

I started developing everything to make it run on an emulator, but to make the challenge truly authentic, I realized that this wasn’t enough. I bought an original NES console, prepared a flashable cartridge (Everdrive N8), and decided: this code must run on real 1980s hardware.

Welcome to NES AI.

The challenge

Developing for the Nintendo Entertainment System (NES) is not like programming on a modern PC. We are another geological era of computers.

The heart of the NES is a specific CPU, the Ricoh 2A03 (or Ricoh 2A07 for the European PAL version). It’s based on the legendary MOS 6502 (the same processor that powered the Commodore 64 and Apple II), but modified specifically for Nintendo (for example, by removing BCD decimal mode and adding a Picture Processing Unit, PPU, the ancestor of a GPU).

Although they share the same basic architecture, there is a huge difference between the platforms: while the Commodore 64 has 64KB of RAM, the NES has just 2KB of internal RAM. Yes, you read that right: 2048 bytes to manage the entire game state, the stack and, in our case, a Neural Network.

Nowadays, the specifications are laughable:

6502 Instruction Set
6502 Instruction Set

The mirage of Go

Initially, driven by enthusiasm, I wanted to write this project in Go, since I use it as my main language as a freelancer. I hoped that TinyGo, a Go compiler for embedded systems, would support the 6502 chip, but unfortunately this wasn’t the case.

So I started looking for alternatives and found an interesting project called nesgo, a compiler that promised to translate Go code into NES executables. This project parses the Go files of the project to be compiled, build its own AST (Abstract Syntax Tree), and convert the final result into 6502 Assembly code.

Unfortunately, I had to discard it too. Support was still limited for what I needed to do, there were few options for writing robust code (e.g. the only variable types are int8, uint8, and uint16) and, most importantly, as a developer, I had no way of writing code that could be easily optimized by the compiler, With such scarce resources, abstraction is a luxury we can’t afford. Perhaps I should have even used Assembly, but I didn’t think it was necessary, so I opted for C, which still gives me low-level control but is also much more convenient to use (and the code can be optimized by the compiler).

I have learned that one of the few choices for C development on 6502 is the cc65 compiler. It’s a mature and robust suite, but it doesn’t overcome some of the machine’s limitations: there is no support for float, double, and uint64_t types, and it doesn’t have all the features of standard C. The alternative is llvm-mos, but I haven’t looked into it enough to have an opinion.

So I refreshed my knowledge of the C language and started to work on it.

Compiler optimizations

In the build toolchain, I enabled the maximum optimization flags: -Oisr:

-O (Optimize code): Apply general optimizations
-i (Optimize interrupts): Optimizations for the NES interrupts
-s (Optimize size): Compiler produces more compact code, even at the cost of a few extra CPU cycles
-r (Enable register variables): Allows the compiler to use zero-page registers (fast RAM memory in the first 256 addresses) to store the most frequently used local variables

Hardware interface

Writing C code for the 6502 is only half the job. The other half is talking to the specific hardware of the NES: the PPU (Picture Processing Unit) for graphics, the APU (Audio Processing Unit) for sound, and the player input gamepad.

These components are mapped in memory at specific addresses (e.g., $2000 - $2007 for the PPU registers) and require precise communication protocols, strict timing (such as writing to VRAM only during vertical V-Blank), and interrupt handling. Doing all this by hand in C or Assembly for every single pixel or sprite would be a nightmare. This is where neslib comes in, a library originally written by Shiru (a pioneer of NES homebrews).

neslib is the bridge between high level logic and the baremetal hardware, and it provides a lightweight but powerful abstraction for common operations:

ppu_on_all() / ppu_off(): Turns rendering on and off
vram_write(): Writes data to video memory by accessing PPU registers
pad_trigger(): Reads controller input by dealing with hardware polling

Without neslib, I would have had to write hundreds of lines of Assembly code just to display the result on the screen.

Fixed point math

Without floating point types, how do we represent the weights of a Neural Network (which are typically small numbers such as 0.1234 or -0.5678)? We can use fixed point math to add software support for floating point numbers and thus perform the operations we need.

I evaluated several libraries, including fptc-lib, but I had to discard it because it requires the uint64_t type for intermediate calculations. The cc65 compiler supports uint32_t at most (and even performing 32 bit calculations on an 8 bit CPU is slow). Furthermore, fptc-lib doesn’t have specific optimizations for 8 bit CPUs.

The final choice was libfixmath, which uses the Q16.16 format: 16 bits for the integer part and 16 bits for the fractional part, stored in a single uint32_t/fix16_t variable.

I forked the original library, and named it libfixmath-cc65, where I corrected the various errors reported by the compiler (for example, the types int64_t, float, and double were used in unused functions, but these are not supported by our compiler). I fixed them all and obtained a version of the library compatible with cc65.

Q16.16 limitations

This conversion isn’t free

With a Q16.16 we can represent numbers from approximately -32768 to +32768. Additionally, the minimum precision is 1/65536, ~0.000015.

Fixed Point Representation
Fixed Point Binary Representation

With our fix16_t quantized model we will never achieve the same precision of a standard float32, but this is an acceptable compromise for our purposes.

Here is a practical example of the precision loss we have to accept:

An original weight of -0.12657789885997772 (in float32), converted to Q16.16 format, becomes the hexadecimal value 0xFFFFDF99. The decimal representation of this value is -0.126572. Basically, we have lost the digits beyond the fifth decimal place.

Linker configuration

In the linker configuration file nes.cfg (which is mandatory: without it, we get a compilation error) I had to carefully define each segment of the final .nes file.

A key point of the configuration is the software stack. The cc65 compiler, in addition to the processor hardware stack (which is always 256 bytes), includes a software stack to manage C local variables and function parameters.

I configured this stack (__STACKSIZE__) to “only” 256 bytes ($0100), which is sufficient to run the program correctly without getting any strange behavior caused by stack overflow errors, maybe because I don’t have many C functions that append values to the software stack. Basically, the 2 stacks together consume 1/4 of the total RAM.

NES Runtime Memory
NES Runtime Memory

The NES processor can only address 64KB of total memory. Out of this, only 32KB (from addresses $8000 to $FFFF) are reserved for the game ROM (PRG-ROM).

Larger games (such as Super Mario Bros. 3, 384KB) bypass this limitation by using special chips inside the cartridge called mappers (or Memory Management Controllers, MMC). These chips use a technique called bank switching: the console only sees 32KB at a time, but the chip can instantly swap that 32KB block of data with another one, and together they form a single larger game (up to several MB), retrieving it as if you were leafing through the pages of a book.

When it was time to pick a cartridge type, I went with the simplest mapper possible: NROM (mapper 0), thus avoiding bank switching. This way, the CPU is connected directly to the ROM memory with nothing in the middle. However, this forces us to fit everything (executable code, graphics libraries, neural network logic, and thousands of model weights) into a single 32KB block.

NROM is the same type of cartridge used by early NES games, including the original Super Mario Bros.. In addition to the 32KB reserved for the code (PRG-ROM), there are also 8KB for graphic sprites (CHR-ROM).

Neural Network

We certainly can’t run a model like GPT-4 (which has ~1.7 trillion parameters) on a 80s console. We needed something small and efficient. I decided to do the classic Hello World problem in Machine Learning, using the MNIST dataset (60,000 images of handwritten digits, 28x28 pixels), but in the Tensorflow Docs it’s recommended to do so using an MLP Neural Network with 2 hidden layers of 700 and 500 neurons. Such a model, with its ~390,000 float32 parameters, occupies ~1.49MB.

I had to simplify the architecture of the neural network to fit it into the very limited space of a NES executable:

The result? The number of parameters dropped from 390,510 to just 3,900. Considering 4 bytes for each weight, the total size of the network weights has been reduced to 15.23KB, 100 times less, and it’s ~1/2 of the total space available in the ROM.

Does this make our AI a "toy"? Maybe, but our plan is to get it up and running, it doesn’t need to be perfect.

Read more

Training & Conversion

I created a Python script (training.py) that uses TensorFlow / Keras to train this mini Neural Network.

The interesting part, however, is in the weights export process. The Python script doesn’t just save the model, but also acts as a transpiler for the weights. Since cc65 doesn’t support floating point operations even in the preprocessing phase, we can’t use the F16(1.23) macro in the C code, because the compiler would not know how to interpret 1.23.

This is what the script does:

  1. Takes float32 weights trained by TensorFlow.
  2. Converts them to Q16.16 format (by multiplying by 65536), exactly as the macro does.
  3. Casts them to uint32 using NumPy to obtain the raw hex representation.
  4. Generates a C file (weights.c) containing an array of uint32_t / fix16_t constants ready to be saved in the ROM.
const fix16_t WEIGHTS[] = {
    0xFFFFDF99, // = -0.12657...
    0x00002D61, // = 0.17725...
    // ... another 3898 weights ...
};

Similarities to LLMs

This project may seem trivial compared to modern Large Language Models (LLMs) such as GPT, but the basic idea is surprisingly similar. Even a trillion parameter giant model like GPT-4 uses Feed Forward Networks (FFNs) internally that are very similar to our small MLP.

Transformer architecture
Transformer architecture

In a Transformer (GPT’s architecture), each Attention Block is followed by a Feed Forward network. This network takes the information processed by the Attention Block and passes it through layers of neurons to "reason", just like our MLP processes pixels. The differences are in scale and complexity, but the concept of multiplying inputs by weights and applying functions (e.g. ReLU or one of its modern variants such as GeLU) remains a milestone of Artificial Intelligence.

C Inference Engine

For the core of the project, I had to write an optimized inference engine from scratch in the mlp.c file.

The weights are stored in a huge const array in ROM. They are never loaded into RAM because the 6502 has a unified 16 bit address space and the processor can read them directly.

Instead of using a complex array to retrieve the right weight (e.g. weights[layer][neuron][input]), I preferred to organize the weights sequentially. A pointer (weights_cursor) starts at the beginning of the array and is simply incremented as the calculations proceed.

Read more

No malloc

It’s impossible to dynamically allocate memory on the NES. The heap doesn't exist, and therefore every call to malloc() returns a NULL value. To save intermediate results (the output of one layer that becomes the input of the next), I statically allocated two arrays (_layer1_output and _layer2_output) in the BSS section of RAM. This completely avoids the use of malloc and the stack for heavy data.

Dense layer

The dense layer (also called Fully Connected layer) is the fundamental part of a MLP model. Its function is equivalent to y = activation(dot(input, weights) + bias). A dense layer produces multiple outputs, and their length is the one defined in the model for the current layer. In our case, hidden layer 1 has 49 outputs, hidden layer 2 has 24, and the output layer has 10 outputs.

In our C code, I implemented the dense layer logic in such a way to minimize the use of registers. We iterate over the outputs, and for each one we have to perform these calculations:

  1. The first value read from the weights array is the bias. This is immediately loaded into the sum variable, which will eventually contain the total sum of the values that will be passed to the activation function.
  2. Next, we iterate over all the inputs. We multiply each input by the correct weight read from the ROM and add the result of each multiplication to sum.
  3. We apply the activation function (e.g. ReLU) to the variable sum and write the result as the output value of the current iteration.

Activation functions

ReLU

Simple, but effective. When the value is less than 0, it returns 0, making it the minimum value of its output. We use it as an activation function in hidden layers.

Stable Softmax

Transforms a set of raw numbers (called logits) returned by the model, into probabilities. In its output, all values are between 0 and 1, and the sum of all values is exactly 1 (100%). It’s used in the output layer to obtain readable results.

Overflow prevention

One of the trickiest challenges when working with fixed point numbers (Q16.16) on an 8 bit system is overflow. The range of values is limited between -32768 and +32768. It doesn’t seem small for the use we are making of it, but there is a chance, even if minimal, that we could exceed this range, for example if we add many large values together.

On an 8 bit system, once the limit is exceeded, the result wraps around: a very positive value suddenly becomes very negative, making our AI useless because it would give inconsistent results.

I have adopted 2 fundamental strategies to eliminate this risk.

Saturated arithmetic

Instead of the normal sum, I used a saturated sum function, available in libfixmath (fix16_sadd). If the result of an addition exceeds the maximum representable value and therefore overflows, the result stops at the maximum representable value instead of becoming a negative number.

Stable Softmax

The normal Softmax function uses exponential functions, which grow very quickly. I implemented the Stable variant: before calculating exp(x), I subtract the maximum value in the input array from each input (x[i] - max(x)). This guarantees that all exponents are <= 0, producing only results between 0 and 1 and thus avoiding any possible overflow.

Conclusions

Once I had compiled everything and obtained the final .nes file, I checked the space occupied in the ROM file: to my surprise (and relief), there were still ~5KB free in the PRG-ROM. This means that not only has the available space challenge been overcome, but there is still space to add game logic, music, or perhaps a graphical interface to insert the input image directly with the gamepad.

The experience of seeing something I wrote running on an old physical console, 40 years after its release date… well, this was the real victory. The execution takes ~20.5 seconds to complete and display the result on the screen.

Running NES AI
NES AI recognized the image of a 1 on a physical NES console

Some numbers are recognized much more easily than others. For example, by manually placing pixels in the input (range [0,10]) and then normalizing them for the model by bringing them into the range [0,1] (e.g., 1 = 0.1), I was able to achieve a 100% probability of the value 0 with a drawing of the zero, it has been fully recognized.

static fix16_t _data_input[MLP_INPUT_SIDE][MLP_INPUT_SIDE] = {
    {0, 0, 0, 0, 0, 0, 0},
    {0, 0, 8, 8, 8, 0, 0},
    {0, 0, 8, 0, 8, 0, 0},
    {0, 0, 8, 0, 8, 0, 0},
    {0, 0, 8, 0, 8, 0, 0},
    {0, 0, 8, 8, 8, 0, 0},
    {0, 0, 0, 0, 0, 0, 0}
};

Instead, I tried drawing a 7 in various ways, but the model never recognized it, claiming that it was a 3 or a 5. Clearly, the output can’t always be correct.

NES AI is an open source project available on GitHub, under the MIT license. Feel free to fork, break, and improve it.