CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

Jiwon Song; Dongwon Jo; Beomseok Kang; Jae-Joon Kim

arXiv:2605.16839·cs.CL·May 19, 2026

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

Jiwon Song, Dongwon Jo, Beomseok Kang, Jae-Joon Kim

PDF

1 Repo

TL;DR

This paper introduces CompactAttention, a novel chunked-prefill attention mechanism that significantly speeds up long-context large language model serving without sacrificing accuracy.

Contribution

It proposes Block-Union KV Selection, enabling efficient in-place KV access and improved speed for chunked prefill in large language models.

Findings

01

Achieves up to 2.72× speedup at 128K context length.

02

Maintains accuracy close to dense attention on RULER benchmark.

03

Efficiently preserves all selected KV blocks without explicit KV compaction.

Abstract

Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed for one-shot prefill and do not translate efficiently to chunked prefill: block-sparse kernels lose efficiency when the query length is limited by the chunk size, while fine-grained pattern search becomes costly when repeated over the accumulated KV cache at every chunk. QUOKA, a recent method that directly targets chunked prefill, avoids sparse-kernel overhead but relies on query-subsampled, token-level KV selection, which can miss query-specific KV entries and introduce explicit KV-copy overhead. To address these limitations, we propose CompactAttention, a chunked-prefill attention mechanism based on Block-Union KV Selection. CompactAttention treats 2D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jiwonsong-dev/CompactAttention
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.