KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan

TL;DR
KVTuner is a novel framework that adaptively determines layer-wise mixed-precision quantization for KV caches in LLMs, significantly improving inference efficiency while maintaining near-lossless accuracy across various models.
Contribution
It introduces a sensitivity-aware, layer-wise mixed-precision quantization method with offline search and pruning techniques, enhancing flexibility and reducing overhead.
Findings
Achieves nearly lossless 3.25-bit quantization for Llama-3.1-8B-Instruct.
Improves inference throughput by up to 21.25%.
Effectively balances accuracy and efficiency across different models.
Abstract
KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Data Storage Technologies · Advanced Data Compression Techniques · Algorithms and Data Compression
MethodsSoftmax · Attention Is All You Need · Pruning
