Efficient Remote KV Cache Reuse with GPU-native Video Codec

Liang Mi; Weijun Wang; Jinghan Chen; Ting Cao; Haipeng Dai; Yunxin Liu

arXiv:2602.09725·cs.DC·May 13, 2026

Efficient Remote KV Cache Reuse with GPU-native Video Codec

Liang Mi, Weijun Wang, Jinghan Chen, Ting Cao, Haipeng Dai, Yunxin Liu

PDF

TL;DR

KVCodec leverages GPU-native video codecs to efficiently compress, transmit, and restore remote KV caches, significantly reducing inference latency in bandwidth-limited scenarios without accuracy loss.

Contribution

The paper introduces KVCodec, a novel system that uses video codecs for lossless, fast, and widely deployable remote KV cache reuse in large language model inference.

Findings

01

Reduces time-to-first-token by up to 3.51 times.

02

Maintains lossless accuracy while improving transmission efficiency.

03

Prototyped on diverse GPUs demonstrating broad applicability.

Abstract

Remote KV cache reuse fetches KV cache for identical contexts from remote storage, avoiding recomputation, accelerating LLM inference. While it excels in high-speed networks, its performance degrades significantly in bandwidth-limited scenarios. Recent studies address this by transmitting KV caches in compressed form, but the associated heavyweight decompression counteracts the KV reuse benefits. In this paper, we propose an efficient and widely deployable remote KV cache reuse solution that leverages GPU-native video codecs. Our system, KVCodec, enables effective KV cache coding with two techniques. The codec-friendly tensor layout compresses the KV cache in a highly compact video format, enabling fast transmission. The efficient KV fetcher orchestrates the transmission, decoding, and restoration of compressed KV caches in an efficient pipelined manner, eliminating resource contention,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.