RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations
Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu,, Kehong Yuan

TL;DR
RotateKV introduces a novel 2-bit key-value cache quantization method for large language models that maintains high accuracy and robustness while significantly reducing memory usage and increasing decoding speed.
Contribution
The paper proposes RotateKV, a new rotation-based quantization technique with outlier-aware adaptations for robust 2-bit KV cache compression in LLMs.
Findings
Less than 0.3 perplexity degradation on WikiText-2 with 2-bit quantization
Supports 5.75x larger batch sizes due to memory reduction
Achieves 2.32x speedup in decoding stage
Abstract
Key-Value (KV) cache facilitates efficient large language models (LLMs) inference by avoiding recomputation of past KVs. As the batch size and context length increase, the oversized KV caches become a significant memory bottleneck, highlighting the need for efficient compression. Existing KV quantization rely on fine-grained quantization or the retention of a significant portion of high bit-widths caches, both of which compromise compression ratio and often fail to maintain robustness at extremely low average bit-widths. In this work, we explore the potential of rotation technique for 2-bit KV quantization and propose RotateKV, which achieves accurate and robust performance through the following innovations: (i) Outlier-Aware Rotation, which utilizes channel-reordering to adapt the rotations to varying channel-wise outlier distributions without sacrificing the computational efficiency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Algorithms and Data Compression · Caching and Content Delivery
MethodsSoftmax · Attention Is All You Need
