RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via   Outlier-Aware Adaptive Rotations

Zunhai Su; Zhe Chen; Wang Shen; Hanyu Wei; Linge Li; Huangqi Yu,; Kehong Yuan

arXiv:2501.16383·cs.LG·February 4, 2025

RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations

Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu,, Kehong Yuan

PDF

Open Access

TL;DR

RotateKV introduces a novel 2-bit key-value cache quantization method for large language models that maintains high accuracy and robustness while significantly reducing memory usage and increasing decoding speed.

Contribution

The paper proposes RotateKV, a new rotation-based quantization technique with outlier-aware adaptations for robust 2-bit KV cache compression in LLMs.

Findings

01

Less than 0.3 perplexity degradation on WikiText-2 with 2-bit quantization

02

Supports 5.75x larger batch sizes due to memory reduction

03

Achieves 2.32x speedup in decoding stage

Abstract

Key-Value (KV) cache facilitates efficient large language models (LLMs) inference by avoiding recomputation of past KVs. As the batch size and context length increase, the oversized KV caches become a significant memory bottleneck, highlighting the need for efficient compression. Existing KV quantization rely on fine-grained quantization or the retention of a significant portion of high bit-widths caches, both of which compromise compression ratio and often fail to maintain robustness at extremely low average bit-widths. In this work, we explore the potential of rotation technique for 2-bit KV quantization and propose RotateKV, which achieves accurate and robust performance through the following innovations: (i) Outlier-Aware Rotation, which utilizes channel-reordering to adapt the rotations to varying channel-wise outlier distributions without sacrificing the computational efficiency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Algorithms and Data Compression · Caching and Content Delivery

MethodsSoftmax · Attention Is All You Need