CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing
Kuan Lu, Shuhang Lin, Sai Wu, Yichen Yao, Junhan Yang, Huan Li, Wei Chu, Xu Yinghui, Yuan Qi, Gang Chen

TL;DR
CTKVR introduces a two-stage KV retrieval method for long-context LLMs that balances efficiency and accuracy, significantly improving throughput with minimal accuracy loss.
Contribution
The paper proposes a novel centroid-then-token KV retrieval scheme that enhances long-context LLM inference efficiency by combining lightweight centroid indexing with token-level refinement.
Findings
Achieves less than 1% accuracy degradation.
Provides 3x and 4x throughput speedups on Llama-3-8B and Yi-9B models.
Effective across diverse GPU hardware.
Abstract
Large language models (LLMs) are increasingly applied in long-context scenarios such as multi-turn conversations. However, long contexts pose significant challenges for inference efficiency, including high memory overhead from Key-Value (KV) cache and increased latency due to excessive memory accesses. Recent methods for dynamic KV selection struggle with trade-offs: block-level indexing degrades accuracy by retrieving irrelevant KV entries, while token-level indexing incurs high latency from inefficient retrieval mechanisms. In this paper, we propose CTKVR, a novel centroid-then-token KV retrieval scheme that addresses these limitations. CTKVR leverages a key observation: query vectors adjacent in position exhibit high similarity after Rotary Position Embedding (RoPE) and share most of their top-k KV cache entries. Based on this insight, CTKVR employs a two-stage retrieval strategy:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Graph Theory and Algorithms
