CASK: Core-Aware Selective KV Compression for Reasoning Traces
Buseong Kim, Heejun Gwon

TL;DR
CASK introduces a core-aware selective KV compression method that preserves reasoning behavior in large language models by structured consolidation, outperforming existing approaches in fidelity and efficiency.
Contribution
The paper proposes a novel core-aware selective KV compression framework that improves reasoning trace fidelity by combining core preservation with targeted scratch consolidation.
Findings
CASK achieves higher full-KV continuation fidelity than TriAttention at matched budgets.
CASK effectively handles prompt-heavy regimes with a two-stage prefix eviction and decode-stage consolidation.
Experimental results on AIME24 and AIME25 demonstrate CASK's superior performance in reasoning fidelity.
Abstract
In large language models performing long-form reasoning, the KV cache grows rapidly with decode length, creating bottlenecks in memory and inference stability. Existing reasoning-oriented KV compression has mostly followed an eviction-centered view: estimate token importance more accurately, then discard lower-ranked entries. Our analysis suggests that scorer refinement alone often fails to substantially reorganize the actual keep-set and may therefore not be the main lever for preserving reasoning behavior. We instead frame reasoning KV compression as a behavior-preserving structured consolidation problem. CASK partitions the decode-time reasoning trace into a protected core that anchors answer formation and intermediate state, and mergeable scratch with high redundancy. The core is preserved, while selective consolidation is applied only to the scratch. To address prompt-heavy regimes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
