KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache
Fei Li, Song Liu, Weiguo Wu, Shiqiang Nie, Jinyu Wang

TL;DR
KVmix is a gradient-based mixed-precision quantization method for KV Cache in LLMs that dynamically allocates precision based on layer importance, significantly reducing memory while maintaining near-lossless accuracy.
Contribution
It introduces a dynamic, importance-aware mixed-precision quantization technique for KV Cache that adapts to long-context tasks, optimizing memory and computational efficiency.
Findings
Achieves 4.9x memory compression on LLMs.
Delivers 5.3x inference speedup.
Maintains near-lossless inference performance.
Abstract
The high memory demands of the Key-Value (KV) Cache during the inference of Large Language Models (LLMs) severely restrict their deployment in resource-constrained platforms. Quantization can effectively alleviate the memory pressure caused by KV Cache. However, existing methods either rely on static one-size-fits-all precision allocation or fail to dynamically prioritize critical KV in long-context tasks, forcing memory-accuracy-throughput tradeoffs. In this work, we propose a novel mixed-precision quantization method for KV Cache named KVmix. KVmix leverages gradient-based importance analysis to evaluate how individual Key and Value projection matrices affect the model loss, enabling layer-specific bit-width allocation for mix-precision quantization. It dynamically prioritizes higher precision for important layers while aggressively quantizing less influential ones, achieving a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Big Data and Digital Economy
MethodsLLaMA
