SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
Yipin Guo, Siddharth Joshi

TL;DR
SplitZip is a GPU-optimized lossless compression method for KV caches in LLM serving, significantly reducing transfer latency and increasing throughput during disaggregated model deployment.
Contribution
The paper introduces SplitZip, a novel GPU-friendly lossless compressor that exploits exponent redundancy in KV tensors for fast, efficient compression and decompression.
Findings
Achieves 613.3 GB/s compression and 2181.8 GB/s decompression throughput on BF16 tensors.
Up to 1.32x speedup in KV cache transfer and 1.30x in overall throughput.
Extends effectively to FP8 KV caches with 1.14x compression over native formats.
Abstract
Contemporary systems serving large language models (LLMs) have adopted prefill-decode disaggregation to better load-balance between the compute-bound prefill phase and the memory-bound decode phase. Under this design, prefill workers generate a KV cache that must be transferred to decode workers before token generation can begin. With these workers residing on different physical systems, this transfer becomes a significant bottleneck to serving LLMs at scale. This bottleneck gets exacerbated for long-input and agentic workloads. Existing lossless codecs are not suited to this setting as they primarily target offline weight compression, run on the CPU, or use variable-length coding whose decompression is fast but compression is too slow to keep up with KV production during prefill. We introduce SplitZip, a GPU-friendly lossless compressor for KV cache transfer that preserves KV tensors…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
