MiniMax Sparse Attention Cuts Million-Token Compute by 28x Without Quality Loss
MSA delivers 28.4x attention compute reduction and 14.2x prefill speedup at 1M context on a 109B model, matching full GQA quality.
Softmax attention's quadratic scaling has been treated as a fundamental constraint on long-context inference, something to work around with retrieval systems, chunking heuristics, or aggressive context truncation. MiniMax Sparse Attention shows it is none of those things. At one million tokens on a 109-billion-parameter model, it matches full grouped-query attention quality while computing attention over a small fraction of the token space.
The mechanism is blockwise and group-aware. Rather than attending over every key-value pair, MSA adds a lightweight Index Branch that scores KV blocks and selects a Top-k subset independently for each GQA group. This is the structurally important move: different query groups can retrieve different blocks, so the model retains the ability to attend to distinct regions of a long context depending on what each group needs. The Main Branch then runs exact block-sparse attention only over those selected blocks. Nothing is approximated in the attention computation itself; the sparsity is in block selection, not in the attention math.
Getting sparsity to translate into wall-clock speedup is where most sparse attention proposals have historically broken down. MSA co-designs the architecture with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to keep tensor cores busy under block-granular memory access. That kernel design detail is what converts a 28.4x reduction in attention FLOPs into real throughput. On H800 GPUs, the result is 14.2x prefill speedup and 7.6x decoding speedup at one million context tokens, with the 109B multimodal model matching GQA on standard benchmarks. For teams running long-context inference at scale, the takeaway is direct: the compute budget for million-token context is now in the same order of magnitude as short-context inference, and the architecture to do it is open and shipped.
We're thinking: We find the group-specific retrieval design more significant than the headline speedup numbers. Most sparse attention schemes apply a single shared sparsity pattern across all attention heads or groups, which forces a tradeoff between coverage and compute. MSA's per-group block selection means the model can, in principle, specialize different GQA groups toward different retrieval behaviors, something closer to how mixture-of-experts routing works at the layer level. The open question is whether that specialization actually emerges during training on the 109B model, and whether it holds at smaller scales. If it does, this is not just an inference optimization but a structural argument that sparse retrieval and full-attention expressivity are not in conflict.
Key takeaways:
- MSA adds a lightweight Index Branch that scores and selects Top-k KV blocks per GQA group, then runs exact block-sparse attention over only those blocks, keeping sparsity in selection rather than in the attention kernel itself.
- On a 109B multimodal model at 1M context, MSA reduces per-token attention compute by 28.4x and delivers 14.2x prefill and 7.6x decoding wall-clock speedups on H800, matching GQA quality; caveat is that results are reported on a single architecture and hardware generation.
- Teams building agentic or long-document inference pipelines should evaluate MSA's open kernel against their current attention implementation before investing further in retrieval-based context compression workarounds.
Source: MiniMax Sparse Attention