ZipCache: Accurate and Efficient KV Cache Quantization with Salient   Token Identification

Yefei He; Luoming Zhang; Weijia Wu; Jing Liu; Hong Zhou; Bohan Zhuang

arXiv:2405.14256·cs.LG·May 24, 2024·1 cites

ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang

PDF

Open Access 1 Repo 1 Video

TL;DR

ZipCache introduces a novel KV cache quantization method for large language models that significantly reduces memory and latency while maintaining high accuracy, by accurately identifying salient tokens and employing efficient quantization schemes.

Contribution

The paper proposes ZipCache, a new quantization approach that improves KV cache compression accuracy and efficiency through channel-separable quantization and normalized attention score-based saliency detection.

Findings

01

Achieves 4.98x KV cache compression with only 0.38% accuracy loss on GSM8k.

02

Reduces prefill latency by 37.3%, decoding latency by 56.9%, and GPU memory by 19.8%.

03

Outperforms previous KV cache compression methods in speed and accuracy retention.

Abstract

KV cache stores key and value states from previous tokens to avoid re-computation, yet it demands substantial storage space, especially for long sequences. Adaptive KV cache compression seeks to discern the saliency of tokens, preserving vital information while aggressively compressing those of less importance. However, previous methods of this approach exhibit significant performance degradation at high compression ratios due to inaccuracies in identifying salient tokens. In this paper, we present ZipCache, an accurate and efficient KV cache quantization method for LLMs. First, we construct a strong baseline for quantizing KV cache. Through the proposed channel-separable tokenwise quantization scheme, the memory overhead of quantization parameters are substantially reduced compared to fine-grained groupwise quantization. To enhance the compression ratio, we propose normalized attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thisisbillhe/zipcache
pytorch

Videos

ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification· slideslive

Taxonomy

TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Network Packet Processing and Optimization

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings