TL;DR
EntmaxKV introduces a support-aware sparse decoding method for entmax attention, significantly reducing memory traffic and increasing decoding speed in long-context language models while maintaining accuracy.
Contribution
It presents a novel entmax-native sparse decoding framework that exploits support recovery and adaptive candidate selection to improve efficiency over softmax-based methods.
Findings
Achieves up to 3.36x speedup over full attention at 1M context length.
Drops less probability mass and retains more support tokens than softmax-based methods.
Closely matches full-cache entmax performance with significantly reduced KV cache usage.
Abstract
Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass. In contrast, -entmax produces exact zeros, turning sparse decoding from dense-tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known. In this work, we introduce EntmaxKV, an entmax-native sparse decoding framework that exploits sparsity before KV pages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
