KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Xing Li; Zeyu Xing; Yiming Li; Linping Qu; Hui-Ling Zhen; Wulong Liu; Yiwu Yao; Sinno Jialin Pan; Mingxuan Yuan

arXiv:2502.04420·cs.LG·November 21, 2025

KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan

PDF

Open Access 1 Repo 1 Video

TL;DR

KVTuner is a novel framework that adaptively determines layer-wise mixed-precision quantization for KV caches in LLMs, significantly improving inference efficiency while maintaining near-lossless accuracy across various models.

Contribution

It introduces a sensitivity-aware, layer-wise mixed-precision quantization method with offline search and pruning techniques, enhancing flexibility and reducing overhead.

Findings

01

Achieves nearly lossless 3.25-bit quantization for Llama-3.1-8B-Instruct.

02

Improves inference throughput by up to 21.25%.

03

Effectively balances accuracy and efficiency across different models.

Abstract

KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cmd2001/KVTuner
pytorchOfficial

Videos

KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference· slideslive

Taxonomy

TopicsAdvanced Data Storage Technologies · Advanced Data Compression Techniques · Algorithms and Data Compression

MethodsSoftmax · Attention Is All You Need · Pruning