Efficient Remote KV Cache Reuse with GPU-native Video Codec
Liang Mi, Weijun Wang, Jinghan Chen, Ting Cao, Haipeng Dai, Yunxin Liu

TL;DR
KVCodec leverages GPU-native video codecs to efficiently compress, transmit, and restore remote KV caches, significantly reducing inference latency in bandwidth-limited scenarios without accuracy loss.
Contribution
The paper introduces KVCodec, a novel system that uses video codecs for lossless, fast, and widely deployable remote KV cache reuse in large language model inference.
Findings
Reduces time-to-first-token by up to 3.51 times.
Maintains lossless accuracy while improving transmission efficiency.
Prototyped on diverse GPUs demonstrating broad applicability.
Abstract
Remote KV cache reuse fetches KV cache for identical contexts from remote storage, avoiding recomputation, accelerating LLM inference. While it excels in high-speed networks, its performance degrades significantly in bandwidth-limited scenarios. Recent studies address this by transmitting KV caches in compressed form, but the associated heavyweight decompression counteracts the KV reuse benefits. In this paper, we propose an efficient and widely deployable remote KV cache reuse solution that leverages GPU-native video codecs. Our system, KVCodec, enables effective KV cache coding with two techniques. The codec-friendly tensor layout compresses the KV cache in a highly compact video format, enabling fast transmission. The efficient KV fetcher orchestrates the transmission, decoding, and restoration of compressed KV caches in an efficient pipelined manner, eliminating resource contention,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
