VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
Jiayi Yao, Samuel Shen, Kuntai Du, Shaoting Feng, Dongjoo Seo, Rui Zhang, Yuyang Huang, Yuhan Liu, Shan Lu, Junchen Jiang

TL;DR
VeriCache is an inference framework that maintains full-KV-cache output accuracy while significantly improving throughput by combining compressed KV cache drafting with verification, addressing GPU memory and swapping challenges.
Contribution
It introduces VeriCache, a novel inference method that ensures lossless output with high throughput by verifying compressed KV cache drafts against full cache without GPU memory overhead.
Findings
Achieves up to 4X higher throughput than full-KV inference.
Produces identical outputs to full-KV-cache decoding.
Supports various compression methods and integrates with existing decoding techniques.
Abstract
The large size of the KV cache has become a major bottleneck for serving LLMs with increasing context lengths. In response, many KV cache compression methods, such as token dropping and quantization, have been proposed. However, almost all of these methods are inherently lossy-despite minimal accuracy degradation for short outputs, their outputs increasingly diverge from full-KV-cache outputs as more tokens are decoded, which leads to catastrophic failures in code generation and tool calling. We present VeriCache, the first inference framework that ensures the same output as full-KV-cache decoding but largely preserves the high decoding throughput of a range of KV cache compression algorithms. VeriCache uses the compressed KV cache to draft tokens, then verifies them against the full KV cache. While it may seem like just speculative decoding, VeriCache requires addressing a key system…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
