CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing

Kuan Lu; Shuhang Lin; Sai Wu; Yichen Yao; Junhan Yang; Huan Li; Wei Chu; Xu Yinghui; Yuan Qi; Gang Chen

arXiv:2512.15550·cs.CL·December 18, 2025

CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing

Kuan Lu, Shuhang Lin, Sai Wu, Yichen Yao, Junhan Yang, Huan Li, Wei Chu, Xu Yinghui, Yuan Qi, Gang Chen

PDF

Open Access

TL;DR

CTKVR introduces a two-stage KV retrieval method for long-context LLMs that balances efficiency and accuracy, significantly improving throughput with minimal accuracy loss.

Contribution

The paper proposes a novel centroid-then-token KV retrieval scheme that enhances long-context LLM inference efficiency by combining lightweight centroid indexing with token-level refinement.

Findings

01

Achieves less than 1% accuracy degradation.

02

Provides 3x and 4x throughput speedups on Llama-3-8B and Yi-9B models.

03

Effective across diverse GPU hardware.

Abstract

Large language models (LLMs) are increasingly applied in long-context scenarios such as multi-turn conversations. However, long contexts pose significant challenges for inference efficiency, including high memory overhead from Key-Value (KV) cache and increased latency due to excessive memory accesses. Recent methods for dynamic KV selection struggle with trade-offs: block-level indexing degrades accuracy by retrieving irrelevant KV entries, while token-level indexing incurs high latency from inefficient retrieval mechanisms. In this paper, we propose CTKVR, a novel centroid-then-token KV retrieval scheme that addresses these limitations. CTKVR leverages a key observation: query vectors adjacent in position exhibit high similarity after Rotary Position Embedding (RoPE) and share most of their top-k KV cache entries. Based on this insight, CTKVR employs a two-stage retrieval strategy:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Graph Theory and Algorithms