oMLX SSD KV Cache Enables 3x Faster LLMs on M2 Macs
Better Stackgo watch the original →
the gist
oMLX uses two-tier KV cache to offload inactive context to SSD, running Qwen 3.6 35B at 47 tokens/s on M2 MacBook Pro versus LM Studio's 16 tokens/s, while preserving multitasking.
The Breakthrough
oMLX implements a two-tier KV cache that retains immediate context in unified memory and swaps older conversation parts, such as system prompts and tool definitions, to SSD. This approach delivers 47 tokens per second with Qwen 3.6 35B 4bit on an M2 MacBook Pro.
What Actually Worked
Apple's MLX framework powers oMLX with zero-copy arrays, so the CPU reads GPU results instantly without data movement, and lazy computation optimizes the graph on the fly.
Users launch the oMLX server, specify a location, enter an API key, and access a dashboard to load models like Qwen 3.6 35B 4bit. The dashboard provides code snippets for agent harnesses, such as this codex CLI command:
codeex --model qwen2.5-coder-32b-instruct-q4_K_M --api-base http://localhost:8080/v1
For a coding task, users prompt codex CLI to build a movie search web app that wishlists and rates films via MovieDB API key. Real-time metrics show tokens generated, cached, cache efficiency, and tokens per second.
Persistent SSD caching survives /clear commands. After clearing due to 32K context limit, a follow-up prompt like "continue where we left off" reloads the state from disk, preventing hallucinations.
Before / After
oMLX completed the movie app task in 20 minutes at 47 tokens per second with 89% cache efficiency (1.78 million tokens generated, 1.59 million cached). LM Studio took 35 minutes at 16 tokens per second. oMLX allowed web browsing and video playback during inference; LM Studio caused RAM exhaustion and monitor lag. oMLX hit occasional 400 errors from context limits but recovered via caching; LM Studio had no errors.
Context
Apple Silicon's unified memory eliminates CPU-GPU data copying unlike traditional PCs with separate pools and PCI bus transfers. However, KV cache still fills RAM with full conversation history, limiting high-parameter models on standard Macs like M2 MacBook Pro. The video tests oMLX against LM Studio on a real coding task to show viable local agents without 128GB RAM.
Notable Quotes
"Quen 3.6 with the help of OMLX was able to get through the task by churning out 1.78 million tokens and roughly 1.59 million of them were cached. So we ended up with an 89% cash efficiency."
"The average token per second speed on LM Studio was 16 tokens per second and on OMLX it was roughly 47."