LiteCache: A Query Similarity-Driven, GPU-Centric KVCache Subsystem for Efficient LLM Inference

Jiawei Yi; Ping Gong; Youhui Bai; Zewen Jin; Shengnan Wang; Jiaqi Ruan; Jia He; Jiaan Zhu; Pengcheng Wang; Haibo Wang; Weiguang Wang; Xia Zhu; Cheng Li

arXiv:2511.14510·cs.LG·March 30, 2026

LiteCache: A Query Similarity-Driven, GPU-Centric KVCache Subsystem for Efficient LLM Inference

Jiawei Yi, Ping Gong, Youhui Bai, Zewen Jin, Shengnan Wang, Jiaqi Ruan, Jia He, Jiaan Zhu, Pengcheng Wang, Haibo Wang, Weiguang Wang, Xia Zhu, Cheng Li

PDF

1 Repo

TL;DR

LiteCache is a GPU-centric KVCache system that leverages query similarity to improve LLM inference efficiency, reducing CPU overhead and boosting throughput significantly.

Contribution

It introduces QSAC, a head-level cache reuse algorithm, and a GPU-centric LiteCache system that minimizes CPU involvement and enhances data transfer efficiency.

Findings

01

Achieves 10.7-224.2% throughput improvement on H100 and A40 GPUs.

02

Supports sequence lengths beyond 1 million tokens.

03

Maintains accuracy comparable to baseline methods.

Abstract

During LLM inference, KVCache memory usage grows linearly with sequence length and batch size and often exceeds GPU capacity. Recent proposals offload KV states to host memory and reduce transfers using top-k attention. But their CPU-centric management of the on-GPU cache and CPU-GPU data movement incurs high overhead and fragments the bulk GPU execution that CUDA Graph relies on. To close this gap, we observe that adjacent queries within the same attention head exhibit strong directional similarity and retrieve highly overlapping top-k KV states. This insight enables a simple head granularity cache algorithm, QSAC, in which each head reuses its previously cached KV states whenever the current query is sufficiently similar to the prior one. QSAC further simplifies cache management primitives and cuts CPU involvement almost entirely. We develop LiteCache, a KVCache subsystem that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://anonymous.4open.science/r/LiteCache-888D
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.