InnerQ: Hardware-Aware Tuning-Free Quantization of KV Cache for Large Language Models

Sayed Mohammadreza Tayaranian Hosseini; Amir Ardakani; Warren J. Gross

arXiv:2602.23200·cs.LG·May 22, 2026

InnerQ: Hardware-Aware Tuning-Free Quantization of KV Cache for Large Language Models

Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross

PDF

TL;DR

InnerQ is a hardware-aware quantization method for KV caches in large language models that reduces decoding latency and improves evaluation performance by optimizing data reuse and quantization strategies.

Contribution

InnerQ introduces a novel group-wise quantization scheme with techniques like hybrid quantization and per-channel normalization to enhance speed and fidelity in KV cache compression.

Findings

01

Achieves 1.3x speedup over prior methods

02

Achieves 2.7x speedup over non-quantized baseline

03

Improves few-shot evaluation scores with quantized KV caches

Abstract

When transformer-based language models are deployed for text generation, most of the inference time is spent in the decoding stage, where output tokens are generated sequentially. Reducing the hardware cost of each decoding step is therefore critical for efficient long-context generation. A major bottleneck is the key-value (KV) cache, whose size grows with sequence length and often dominates the model's memory footprint. Prior work has proposed quantization methods to compress the KV cache while minimizing its loss of precision. We present InnerQ, a hardware-aware KV cache quantization scheme that reduces decode latency without compromising evaluation performance. InnerQ performs group-wise quantization by grouping cache matrices along their inner dimension. This grouping strategy aligns dequantization with vector-matrix multiplication and increases data reuse across GPU compute units.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Speech Recognition and Synthesis