90M Falcon Runs on 2014 Raspberry Pi

Better Stackgo watch the original →

Cross-compiled llama.cpp for ARMv6 Pi1, used Raspberry Pi OS Lite and --no-mmap; 4-bit/8-bit Falcon-H1-Tiny 90M produces coherent responses at 1 tok/3s.

The Breakthrough

The author cross-compiled llama.cpp for ARMv6 with no Neon/OpenMP/shared libs, ran 4-bit and 8-bit quantized Falcon-H1-Tiny 90M-Instruct on a 2014 Raspberry Pi (512MB RAM, 700MHz single-core) using Raspberry Pi OS Lite and --no-mmap flag, and generated coherent prompt responses.

What Actually Worked

  • Flashed Raspberry Pi OS Lite (32-bit) with Wi-Fi/SSH preconfigured to minimize idle memory usage and enable remote management.
  • Cloned llama.cpp source, used dockcross on Mac (ARMv8) to cross-compile for ARMv6-VFP: cmake .. -DLLAMA_CROSS=ON -DCMAKE_TOOLCHAIN_FILE=$HOME/dockcross/linux-armv6.cmake -DBUILD_SHARED_LIBS=OFF -DLLAMA_NEON=OFF -DLLAMA_OPENMP=OFF, built ./llama-cli binary in 2 minutes, and SCP'd to Pi.
  • Downloaded legacy Q2_K/Q4_0/Q8_0 GGUF models from https://huggingface.co/tiiuae/Falcon-H1-Tiny-90M-Instruct-GGUF into /models on Pi.
  • Ran inference with ./llama-cli -m falcon-h1-tiny-90m-instruct-q4_0.gguf -p "hello how are you" -n 32 -t 1 -c 128 --no-mmap (single thread, 128 ctx, no mmap to avoid 32-bit address fragmentation).
  • Falcon-H1-Tiny uses hybrid Transformer + Mamba architecture for compactness at 90M params.

Before / After

2-bit (Q2_K): Generates 1 token every 3 seconds with incoherent nonsense (e.g., garbled response to "hello how are you"). 4-bit (Q4_0): Coherent greeting response. 8-bit (Q8_0): Correctly answers "capital of Belgium" (Brussels) but errs on "capital of Albania" (says Tana, actual Tirana); shows knowledge gaps on obscure topics due to param limit.

Context

The author started with a first-gen Raspberry Pi's constraints: ARMv6 lacking Neon instructions, 512MB RAM prone to mmap fragmentation, single-core 700MHz CPU. The experiment tested if a 90M LLM could run locally by quantizing to legacy Q4/Q8 (avoiding IQ2 needing modern CPU), cross-compiling llama.cpp, and stripping OS bloat. The result proves tiny hybrid-architecture models enable edge AI on vintage hardware, though too slow (~0.3 tok/s) for production.

Notable Quotes

  • "On a 90 million parameter model the weights are so compressed that the linguistic logic has basically collapsed it's barely coherent."
  • "Now we get a coherent greeting back so that is a success we now have an actual AI model running locally on the Pi."
  • "The 90 million parameter crunch comes with its own cost it might have accurate knowledge about larger more popular countries but lacks knowledge about lesser known countries."

Content References

  • Model repo on legacy Q2/Q4/Q8 GGUF files; used for inference tests.
  • #tutorial
  • #demo
  • #ai

summary by x-ai/grok-4.1-fast. probably wrong about something. check the source.