TL;DR
KV Packet introduces a recomputation-free caching method for LLMs that uses immutable document packets and soft-token adapters, reducing latency and FLOPs while maintaining accuracy.
Contribution
It presents a novel cache reuse framework that eliminates recomputation by treating cached documents as immutable packets with trainable adapters.
Findings
Achieves near-zero FLOPs compared to recomputation methods.
Reduces Time-to-First-Token (TTFT) latency.
Maintains F1 scores comparable to full recomputation baselines.
Abstract
Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution. Existing solutions such as CacheBlend, EPIC, and SAM-KV mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs) and increased Time-to-First-Token (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable ``packets'' wrapped in light-weight trainable soft-token adapters, which are trained via self-supervised distillation to bridge context discontinuities. Experiments on Llama-3.1 and Qwen2.5 demonstrate that the proposed KV Packet method achieves near-zero…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
