TL;DR
LiteCache is a GPU-centric KVCache system that leverages query similarity to improve LLM inference efficiency, reducing CPU overhead and boosting throughput significantly.
Contribution
It introduces QSAC, a head-level cache reuse algorithm, and a GPU-centric LiteCache system that minimizes CPU involvement and enhances data transfer efficiency.
Findings
Achieves 10.7-224.2% throughput improvement on H100 and A40 GPUs.
Supports sequence lengths beyond 1 million tokens.
Maintains accuracy comparable to baseline methods.
Abstract
During LLM inference, KVCache memory usage grows linearly with sequence length and batch size and often exceeds GPU capacity. Recent proposals offload KV states to host memory and reduce transfers using top-k attention. But their CPU-centric management of the on-GPU cache and CPU-GPU data movement incurs high overhead and fragments the bulk GPU execution that CUDA Graph relies on. To close this gap, we observe that adjacent queries within the same attention head exhibit strong directional similarity and retrieve highly overlapping top-k KV states. This insight enables a simple head granularity cache algorithm, QSAC, in which each head reuses its previously cached KV states whenever the current query is sufficiently similar to the prior one. QSAC further simplifies cache management primitives and cuts CPU involvement almost entirely. We develop LiteCache, a KVCache subsystem that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
