SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size
Jinhan Chen, Jianchun Liu, Hongli Xu, Xianjun Gao, Shilong Wang

TL;DR
SABlock introduces a semantic-aware cache eviction method with adaptive block sizing that significantly reduces memory usage and improves decoding speed for long-context LLM inference by intelligently segmenting and compressing KV caches.
Contribution
It proposes a novel semantic segmentation and adaptive block sizing framework for KV cache eviction, enhancing memory efficiency without sacrificing semantic integrity.
Findings
Achieves 99.9% retrieval accuracy with only 96 KV entries.
Reduces peak memory by 46.28% under fixed budget.
Improves decoding speed by up to 9.5x at 128K context length.
Abstract
The growing memory footprint of the Key-Value (KV) cache poses a severe scalability bottleneck for long-context Large Language Model (LLM) inference. While KV cache eviction has emerged as an effective solution by discarding less critical tokens, existing token-, block-, and sentence-level compression methods struggle to balance semantic coherence and memory efficiency. To this end, we introduce SABlock, a \underline{s}emantic-aware KV cache eviction framework with \underline{a}daptive \underline{block} sizes. Specifically, SABlock first performs semantic segmentation to align compression boundaries with linguistic structures, then applies segment-guided token scoring to refine token importance estimation. Finally, for each segment, a budget-driven search strategy adaptively determines the optimal block size that preserves semantic integrity while improving compression efficiency under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
