EntmaxKV: Support-Aware Decoding for Entmax Attention

Gon\c{c}alo Duarte; Miguel Couceiro; Marcos V. Treviso

arXiv:2605.21649·cs.LG·May 22, 2026

EntmaxKV: Support-Aware Decoding for Entmax Attention

Gon\c{c}alo Duarte, Miguel Couceiro, Marcos V. Treviso

PDF

1 Repo

TL;DR

EntmaxKV introduces a support-aware sparse decoding method for entmax attention, significantly reducing memory traffic and increasing decoding speed in long-context language models while maintaining accuracy.

Contribution

It presents a novel entmax-native sparse decoding framework that exploits support recovery and adaptive candidate selection to improve efficiency over softmax-based methods.

Findings

01

Achieves up to 3.36x speedup over full attention at 1M context length.

02

Drops less probability mass and retains more support tokens than softmax-based methods.

03

Closely matches full-cache entmax performance with significantly reduced KV cache usage.

Abstract

Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass. In contrast, $α$ -entmax produces exact zeros, turning sparse decoding from dense-tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known. In this work, we introduce EntmaxKV, an entmax-native sparse decoding framework that exploits sparsity before KV pages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deep-spin/entmaxkv
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.