NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics

Zhihang Cai; Xingjun Zhang; Zhendong Tan; Zheng Wei

arXiv:2505.16210·cs.LG·May 23, 2025

NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics

Zhihang Cai, Xingjun Zhang, Zhendong Tan, Zheng Wei

PDF

Open Access

TL;DR

NQKV introduces a normal distribution-based quantization scheme for KV caches in LLMs, significantly reducing memory usage and boosting inference throughput without major accuracy loss.

Contribution

The paper proposes a novel quantization method leveraging normal distribution characteristics for KV caches, enabling lower-bit quantization with minimal accuracy impact.

Findings

01

Enables 2x larger batch sizes during inference.

02

Allows 4x longer context lengths.

03

Achieves 9.3x throughput improvement.

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks. However, LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands, which significantly increases the memory resource consumption of the Key-Value (KV) cache during inference, becoming a major bottleneck in LLM deployment. To address this issue, quantization is a common and straightforward approach. Currently, quantization methods for activations are limited to 8-bit, and quantization to even lower bits can lead to substantial accuracy drops. To further save space by quantizing the KV cache to even lower bits, we analyzed the element distribution of the KV cache and designed the NQKV algorithm. Since the elements within each block of the KV cache follow a normal distribution, NQKV employs per-block quantile quantization to achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Natural Language Processing Techniques

MethodsOPT