DeepSeek V4 Slashes Inference via DSA, CSA, HCA
Caleb Writes Codego watch the original →
the gist
DeepSeek V4 reduces KV cache to 10% and flops to 27% of prior models by adapting DSA with CSA (4x compression + top-1000 selection) and HCA (128x compression plain attention), interleaved for long-context efficiency.
The Breakthrough
DeepSeek V4, a 1.6 trillion parameter model trained on 32 trillion tokens, cuts inference costs through DeepSeek Sparse Attention (DSA), Compressed Sparse Attention (CSA), and Heavily Compressed Attention (HCA), reducing KV cache requirements to 10% and flop compute to 27% compared to prior versions.
What Actually Worked
- DSA employs a separately trained lightning indexer at lower precision to rank and select top-K tokens by importance, discarding others to minimize attention compute and KV cache memory.
- CSA preprocesses tokens with 4x compression by grouping them into single entries, then applies DSA to select top-1000 from the compressed set.
- HCA applies 128x compression to every token while using plain attention, preserving global document view for long-term dependencies.
- The model interleaves CSA and HCA layers sequentially to balance short-term (sparse, compressed) and long-term (heavily compressed plain) dependencies.
- DeepSeek incorporates prior innovations like manifold constrained hyperconnections (MCH) for expressive residual networks, compounding efficiency gains across versions.
Before / After
V3.1 inference costs scale to $4.80 per million input tokens and $16.50 per million output tokens at 1M context (linear extrapolation from 128K). V3.2 drops to $1.15 input and $1.25 output. V4 further reduces KV cache to 10% and flops to 27% of V3.2. Running V4 Pro 24/7 for a month costs $235, comparable to Anthropic's $200/month rate-limited subscription.
Context
DeepSeek faces compute constraints in China, prioritizing token efficiency over raw intelligence where US closed models like Claude lead. Open models like DeepSeek V4 close the gap via architecture tweaks, enabling 1M context windows affordably versus early GPT limits of 4K tokens. This reflects open-source compounding: V4 builds on V3's DSA, adding CSA/HCA for inference dominance amid GPU shortages and provider price wars.
Notable Quotes
"DSA uses a lightning indexer which is aimed directly at solving that very inefficient scaling of attention by prioritizing only top K tokens and throwing out the rest." "By interle HCA and CSA back and forth the model learns the intuition of how to keep track of long-term dependencies and short-term dependencies as it moves through the network."
Content References
No specific external papers, books, or datasets named beyond DeepSeek's own V4 release notes and prior models.