90M Falcon Runs on 2014 Raspberry Pi

Better Stack2026-05-11go watch the original →

the gist

Cross-compiled llama.cpp for ARMv6 Pi1, used Raspberry Pi OS Lite and --no-mmap; 4-bit/8-bit Falcon-H1-Tiny 90M produces coherent responses at 1 tok/3s.

The Breakthrough

The author cross-compiled llama.cpp for ARMv6 with no Neon/OpenMP/shared libs, ran 4-bit and 8-bit quantized Falcon-H1-Tiny 90M-Instruct on a 2014 Raspberry Pi (512MB RAM, 700MHz single-core) using Raspberry Pi OS Lite and --no-mmap flag, and generated coherent prompt responses.

What Actually Worked

Flashed Raspberry Pi OS Lite (32-bit) with Wi-Fi/SSH preconfigured to minimize idle memory usage and enable remote management.
Cloned llama.cpp source, used dockcross on Mac (ARMv8) to cross-compile for ARMv6-VFP: cmake .. -DLLAMA_CROSS=ON -DCMAKE_TOOLCHAIN_FILE=$HOME/dockcross/linux-armv6.cmake -DBUILD_SHARED_LIBS=OFF -DLLAMA_NEON=OFF -DLLAMA_OPENMP=OFF, built ./llama-cli binary in 2 minutes, and SCP'd to Pi.
Downloaded legacy Q2_K/Q4_0/Q8_0 GGUF models from https://huggingface.co/tiiuae/Falcon-H1-Tiny-90M-Instruct-GGUF into /models on Pi.
Ran inference with ./llama-cli -m falcon-h1-tiny-90m-instruct-q4_0.gguf -p "hello how are you" -n 32 -t 1 -c 128 --no-mmap (single thread, 128 ctx, no mmap to avoid 32-bit address fragmentation).
Falcon-H1-Tiny uses hybrid Transformer + Mamba architecture for compactness at 90M params.

Before / After

2-bit (Q2_K): Generates 1 token every 3 seconds with incoherent nonsense (e.g., garbled response to "hello how are you"). 4-bit (Q4_0): Coherent greeting response. 8-bit (Q8_0): Correctly answers "capital of Belgium" (Brussels) but errs on "capital of Albania" (says Tana, actual Tirana); shows knowledge gaps on obscure topics due to param limit.

Context

The author started with a first-gen Raspberry Pi's constraints: ARMv6 lacking Neon instructions, 512MB RAM prone to mmap fragmentation, single-core 700MHz CPU. The experiment tested if a 90M LLM could run locally by quantizing to legacy Q4/Q8 (avoiding IQ2 needing modern CPU), cross-compiling llama.cpp, and stripping OS bloat. The result proves tiny hybrid-architecture models enable edge AI on vintage hardware, though too slow (~0.3 tok/s) for production.

Notable Quotes

"On a 90 million parameter model the weights are so compressed that the linguistic logic has basically collapsed it's barely coherent."
"Now we get a coherent greeting back so that is a success we now have an actual AI model running locally on the Pi."
"The 90 million parameter crunch comes with its own cost it might have accurate knowledge about larger more popular countries but lacks knowledge about lesser known countries."

Content References

Model repo on legacy Q2/Q4/Q8 GGUF files; used for inference tests.