KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache

Fei Li; Song Liu; Weiguo Wu; Shiqiang Nie; Jinyu Wang

arXiv:2506.08018·cs.LG·February 3, 2026

KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache

Fei Li, Song Liu, Weiguo Wu, Shiqiang Nie, Jinyu Wang

PDF

Open Access 1 Video

TL;DR

KVmix is a gradient-based mixed-precision quantization method for KV Cache in LLMs that dynamically allocates precision based on layer importance, significantly reducing memory while maintaining near-lossless accuracy.

Contribution

It introduces a dynamic, importance-aware mixed-precision quantization technique for KV Cache that adapts to long-context tasks, optimizing memory and computational efficiency.

Findings

01

Achieves 4.9x memory compression on LLMs.

02

Delivers 5.3x inference speedup.

03

Maintains near-lossless inference performance.

Abstract

The high memory demands of the Key-Value (KV) Cache during the inference of Large Language Models (LLMs) severely restrict their deployment in resource-constrained platforms. Quantization can effectively alleviate the memory pressure caused by KV Cache. However, existing methods either rely on static one-size-fits-all precision allocation or fail to dynamically prioritize critical KV in long-context tasks, forcing memory-accuracy-throughput tradeoffs. In this work, we propose a novel mixed-precision quantization method for KV Cache named KVmix. KVmix leverages gradient-based importance analysis to evaluate how individual Key and Value projection matrices affect the model loss, enabling layer-specific bit-width allocation for mix-precision quantization. It dynamically prioritizes higher precision for important layers while aggressively quantizing less influential ones, achieving a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache· underline

Taxonomy

TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Big Data and Digital Economy

MethodsLLaMA