PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs
Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang

TL;DR
This paper introduces PM-KVQ, a novel quantization method for KV caches in long-CoT LLMs, reducing memory and error while maintaining reasoning performance.
Contribution
The paper proposes a progressive mixed-precision quantization and a new calibration strategy to improve long-CoT LLM reasoning with KV cache compression.
Findings
Up to 8% performance improvement over SOTA baselines.
Effective reduction of quantization error in long-context scenarios.
Enhanced calibration method for better distribution approximation.
Abstract
Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques. However, this long-CoT reasoning process imposes substantial memory overhead due to the large Key-Value (KV) Cache memory overhead. Post-training KV Cache quantization has emerged as a promising compression technique and has been extensively studied in short-context scenarios. However, directly applying existing methods to long-CoT LLMs causes significant performance degradation due to the following two reasons: (1) Large cumulative error: Existing methods fail to adequately leverage available memory, and they directly quantize the KV Cache during each decoding step, leading to large cumulative quantization error. (2) Short-context calibration: Due to Rotary Positional Embedding (RoPE), the use of short-context data during calibration…
Peer Reviews
Decision·ICLR 2026 Poster
The progressive mixed precision quantization is interesting: 1) initially, the high bit (16-bit) quantization is used for the short-sequence; 2) then progressively shrink the bit width, while the high sensitive transformer blocks are maintained with the high bit width to narrow the quantization error; 2) and the memory allocation is block-wise, which is adaptive to the PageAttention. Secondary, the experiments show good, especially for the long-cot tasks, the proposed method improves reasoning
The mixed precision quantization of KV cache is mature in the academia, such as KVTuner which allocate different bit width for different layer and K/V by optimized search algorithm. So the comparison with SOTA mixed quantization methods is not enough. And the practical benefit on the hardware is not given, such as the memory access saving and the throughput increase.
- Clear diagnosis of two long-CoT pain points (cumulative error, RoPE low-frequency channels), with concrete formulations (Eqs. 9-12) motivating the positional interpolation trick. - Simple but effective shrinking rule (Eq. 3) that avoids round-trip dequantization in implementation. - Block-wise allocation objective is standard, implementable, and explains gains when memory is partially free.
- The paper only reports FP16 results, omitting bf16, which is the de facto standard for inference. Since bf16 offers wider dynamic range and distinct hardware behavior, excluding it leaves uncertainty about PM-KVQ’s performance and compatibility in realistic deployment settings. - Accuracy under fake quant: Reporting accuracy without real 2-bit/4-bit kernels weakens the claim that PM-KVQ is robust in practice.
1. Approaching KV cache quantization from a utility-driven perspective is interesting and practically relevant, as it reflects real-world deployment considerations. 2. The figure illustrating the main idea of the paper is well-designed and easy to follow, effectively conveying the core concept. 3. The authors conduct experiments on multiple models, demonstrating the broad applicability and effectiveness of the proposed method.
1. My main concern lies in the fairness of the experimental results presented in Table 1. The baseline comparison is rather limited, as KIVI serves as the only comparable baseline in most evaluations. Since KIVI uses a fixed precision while the proposed method can employ higher precision during generation, the comparison may not be entirely fair. It would be helpful to report the memory usage during generation for both KIVI and the proposed method to provide a more complete picture. It remains u
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Advancements in Photolithography Techniques · Algorithms and Data Compression
