TL;DR
This paper introduces CompactAttention, a novel chunked-prefill attention mechanism that significantly speeds up long-context large language model serving without sacrificing accuracy.
Contribution
It proposes Block-Union KV Selection, enabling efficient in-place KV access and improved speed for chunked prefill in large language models.
Findings
Achieves up to 2.72× speedup at 128K context length.
Maintains accuracy close to dense attention on RULER benchmark.
Efficiently preserves all selected KV blocks without explicit KV compaction.
Abstract
Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed for one-shot prefill and do not translate efficiently to chunked prefill: block-sparse kernels lose efficiency when the query length is limited by the chunk size, while fine-grained pattern search becomes costly when repeated over the accumulated KV cache at every chunk. QUOKA, a recent method that directly targets chunked prefill, avoids sparse-kernel overhead but relies on query-subsampled, token-level KV selection, which can miss query-specific KV entries and introduce explicit KV-copy overhead. To address these limitations, we propose CompactAttention, a chunked-prefill attention mechanism based on Block-Union KV Selection. CompactAttention treats 2D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
