PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs

Tengxuan Liu; Shiyao Li; Jiayi Yang; Tianchen Zhao; Feng Zhou; Xiaohui Song; Guohao Dai; Shengen Yan; Huazhong Yang; Yu Wang

arXiv:2505.18610·cs.CL·May 27, 2025

PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs

Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces PM-KVQ, a novel quantization method for KV caches in long-CoT LLMs, reducing memory and error while maintaining reasoning performance.

Contribution

The paper proposes a progressive mixed-precision quantization and a new calibration strategy to improve long-CoT LLM reasoning with KV cache compression.

Findings

01

Up to 8% performance improvement over SOTA baselines.

02

Effective reduction of quantization error in long-context scenarios.

03

Enhanced calibration method for better distribution approximation.

Abstract

Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques. However, this long-CoT reasoning process imposes substantial memory overhead due to the large Key-Value (KV) Cache memory overhead. Post-training KV Cache quantization has emerged as a promising compression technique and has been extensively studied in short-context scenarios. However, directly applying existing methods to long-CoT LLMs causes significant performance degradation due to the following two reasons: (1) Large cumulative error: Existing methods fail to adequately leverage available memory, and they directly quantize the KV Cache during each decoding step, leading to large cumulative quantization error. (2) Short-context calibration: Due to Rotary Positional Embedding (RoPE), the use of short-context data during calibration…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 5

Strengths

The progressive mixed precision quantization is interesting: 1) initially, the high bit (16-bit) quantization is used for the short-sequence; 2) then progressively shrink the bit width, while the high sensitive transformer blocks are maintained with the high bit width to narrow the quantization error; 2) and the memory allocation is block-wise, which is adaptive to the PageAttention. Secondary, the experiments show good, especially for the long-cot tasks, the proposed method improves reasoning

Weaknesses

The mixed precision quantization of KV cache is mature in the academia, such as KVTuner which allocate different bit width for different layer and K/V by optimized search algorithm. So the comparison with SOTA mixed quantization methods is not enough. And the practical benefit on the hardware is not given, such as the memory access saving and the throughput increase.

Reviewer 02Rating 4Confidence 4

Strengths

- Clear diagnosis of two long-CoT pain points (cumulative error, RoPE low-frequency channels), with concrete formulations (Eqs. 9-12) motivating the positional interpolation trick. - Simple but effective shrinking rule (Eq. 3) that avoids round-trip dequantization in implementation. - Block-wise allocation objective is standard, implementable, and explains gains when memory is partially free.

Weaknesses

- The paper only reports FP16 results, omitting bf16, which is the de facto standard for inference. Since bf16 offers wider dynamic range and distinct hardware behavior, excluding it leaves uncertainty about PM-KVQ’s performance and compatibility in realistic deployment settings. - Accuracy under fake quant: Reporting accuracy without real 2-bit/4-bit kernels weakens the claim that PM-KVQ is robust in practice.

Reviewer 03Rating 4Confidence 5

Strengths

1. Approaching KV cache quantization from a utility-driven perspective is interesting and practically relevant, as it reflects real-world deployment considerations. 2. The figure illustrating the main idea of the paper is well-designed and easy to follow, effectively conveying the core concept. 3. The authors conduct experiments on multiple models, demonstrating the broad applicability and effectiveness of the proposed method.

Weaknesses

1. My main concern lies in the fairness of the experimental results presented in Table 1. The baseline comparison is rather limited, as KIVI serves as the only comparable baseline in most evaluations. Since KIVI uses a fixed precision while the proposed method can employ higher precision during generation, the comparison may not be entirely fair. It would be helpful to report the memory usage during generation for both KIVI and the proposed method to provide a more complete picture. It remains u

Code & Models

Repositories

thu-nics/pm-kvq
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Advancements in Photolithography Techniques · Algorithms and Data Compression