ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang

TL;DR
ZipCache introduces a novel KV cache quantization method for large language models that significantly reduces memory and latency while maintaining high accuracy, by accurately identifying salient tokens and employing efficient quantization schemes.
Contribution
The paper proposes ZipCache, a new quantization approach that improves KV cache compression accuracy and efficiency through channel-separable quantization and normalized attention score-based saliency detection.
Findings
Achieves 4.98x KV cache compression with only 0.38% accuracy loss on GSM8k.
Reduces prefill latency by 37.3%, decoding latency by 56.9%, and GPU memory by 19.8%.
Outperforms previous KV cache compression methods in speed and accuracy retention.
Abstract
KV cache stores key and value states from previous tokens to avoid re-computation, yet it demands substantial storage space, especially for long sequences. Adaptive KV cache compression seeks to discern the saliency of tokens, preserving vital information while aggressively compressing those of less importance. However, previous methods of this approach exhibit significant performance degradation at high compression ratios due to inaccuracies in identifying salient tokens. In this paper, we present ZipCache, an accurate and efficient KV cache quantization method for LLMs. First, we construct a strong baseline for quantizing KV cache. Through the proposed channel-separable tokenwise quantization scheme, the memory overhead of quantization parameters are substantially reduced compared to fine-grained groupwise quantization. To enhance the compression ratio, we propose normalized attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Network Packet Processing and Optimization
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
