InnerQ: Hardware-Aware Tuning-Free Quantization of KV Cache for Large Language Models
Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross

TL;DR
InnerQ is a hardware-aware quantization method for KV caches in large language models that reduces decoding latency and improves evaluation performance by optimizing data reuse and quantization strategies.
Contribution
InnerQ introduces a novel group-wise quantization scheme with techniques like hybrid quantization and per-channel normalization to enhance speed and fidelity in KV cache compression.
Findings
Achieves 1.3x speedup over prior methods
Achieves 2.7x speedup over non-quantized baseline
Improves few-shot evaluation scores with quantized KV caches
Abstract
When transformer-based language models are deployed for text generation, most of the inference time is spent in the decoding stage, where output tokens are generated sequentially. Reducing the hardware cost of each decoding step is therefore critical for efficient long-context generation. A major bottleneck is the key-value (KV) cache, whose size grows with sequence length and often dominates the model's memory footprint. Prior work has proposed quantization methods to compress the KV cache while minimizing its loss of precision. We present InnerQ, a hardware-aware KV cache quantization scheme that reduces decode latency without compromising evaluation performance. InnerQ performs group-wise quantization by grouping cache matrices along their inner dimension. This grouping strategy aligns dequantization with vector-matrix multiplication and increases data reuse across GPU compute units.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Speech Recognition and Synthesis
