CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference
Chao Fei, Guozhong Li, Chenxi Liu, Panos Kalnis

TL;DR
CHESS is a novel system that improves long-context LLM inference by dynamically managing KV cache with a context-aware hierarchical approach, achieving high throughput and quality with minimal cache usage.
Contribution
It introduces a combined algorithm-system design for KV-cache management, enabling efficient, high-quality long-context inference with reduced cache and latency.
Findings
Surpasses full KV quality with only 1% cache
Achieves up to 4.56× throughput increase
Outperforms existing baselines in stability and speed
Abstract
Long-context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context-agnostic: their token selection ignores step-wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall-clock speedups. To address this, we propose \textbf{CHESS}, an \textit{algorithm-system co-design} KV-cache management system. Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Data Storage Technologies · Natural Language Processing Techniques
