90M Falcon Runs on 2014 Raspberry Pi
Better Stackgo watch the original →
the gist
Cross-compiled llama.cpp for ARMv6 Pi1, used Raspberry Pi OS Lite and --no-mmap; 4-bit/8-bit Falcon-H1-Tiny 90M produces coherent responses at 1 tok/3s.
The Breakthrough
The author cross-compiled llama.cpp for ARMv6 with no Neon/OpenMP/shared libs, ran 4-bit and 8-bit quantized Falcon-H1-Tiny 90M-Instruct on a 2014 Raspberry Pi (512MB RAM, 700MHz single-core) using Raspberry Pi OS Lite and --no-mmap flag, and generated coherent prompt responses.
What Actually Worked
- Flashed Raspberry Pi OS Lite (32-bit) with Wi-Fi/SSH preconfigured to minimize idle memory usage and enable remote management.
- Cloned llama.cpp source, used dockcross on Mac (ARMv8) to cross-compile for ARMv6-VFP:
cmake .. -DLLAMA_CROSS=ON -DCMAKE_TOOLCHAIN_FILE=$HOME/dockcross/linux-armv6.cmake -DBUILD_SHARED_LIBS=OFF -DLLAMA_NEON=OFF -DLLAMA_OPENMP=OFF, built./llama-clibinary in 2 minutes, and SCP'd to Pi. - Downloaded legacy Q2_K/Q4_0/Q8_0 GGUF models from https://huggingface.co/tiiuae/Falcon-H1-Tiny-90M-Instruct-GGUF into
/modelson Pi. - Ran inference with
./llama-cli -m falcon-h1-tiny-90m-instruct-q4_0.gguf -p "hello how are you" -n 32 -t 1 -c 128 --no-mmap(single thread, 128 ctx, no mmap to avoid 32-bit address fragmentation). - Falcon-H1-Tiny uses hybrid Transformer + Mamba architecture for compactness at 90M params.
Before / After
2-bit (Q2_K): Generates 1 token every 3 seconds with incoherent nonsense (e.g., garbled response to "hello how are you"). 4-bit (Q4_0): Coherent greeting response. 8-bit (Q8_0): Correctly answers "capital of Belgium" (Brussels) but errs on "capital of Albania" (says Tana, actual Tirana); shows knowledge gaps on obscure topics due to param limit.
Context
The author started with a first-gen Raspberry Pi's constraints: ARMv6 lacking Neon instructions, 512MB RAM prone to mmap fragmentation, single-core 700MHz CPU. The experiment tested if a 90M LLM could run locally by quantizing to legacy Q4/Q8 (avoiding IQ2 needing modern CPU), cross-compiling llama.cpp, and stripping OS bloat. The result proves tiny hybrid-architecture models enable edge AI on vintage hardware, though too slow (~0.3 tok/s) for production.
Notable Quotes
- "On a 90 million parameter model the weights are so compressed that the linguistic logic has basically collapsed it's barely coherent."
- "Now we get a coherent greeting back so that is a success we now have an actual AI model running locally on the Pi."
- "The 90 million parameter crunch comes with its own cost it might have accurate knowledge about larger more popular countries but lacks knowledge about lesser known countries."
Content References
- Model repo on legacy Q2/Q4/Q8 GGUF files; used for inference tests.