CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding
Ziteng Sun, Adrian Benton, Samuel Kushnir, Asher Trockman, Vikas Singh, Suhas Diggavi, Ananda Theertha Suresh

TL;DR
This paper introduces CafeQ, a calibration-free quantization method for large language models that optimizes transformations and adaptive rounding without calibration data, improving accuracy with minimal overhead.
Contribution
CafeQ proposes a novel calibration-free quantization approach using learned transformations and adaptive rounding, eliminating the need for calibration data while maintaining high accuracy.
Findings
Improves 4-bit quantization score from 61.9 to 62.4 on Gemma 2 9B.
Enhances 3-bit quantization score from 52.0 to 60.6.
Achieves comparable performance to calibration-dependent methods like GPTQ.
Abstract
Post-training quantization is an effective method for reducing the serving cost of large language models, where the standard approach is to use a round-to-nearest quantization level scheme. However, this often introduces large errors due to outliers in the weights. Proposed mitigation mechanisms include applying adaptive rounding, random rotation transformations or committing to a post-training target using calibration data. Unfortunately, this reliance on calibration data can be severely limiting in some real-world scenarios as such data may be unavailable or subject to privacy regulations. In this paper, we propose algorithms to optimize transformations and adaptive rounding without access to any calibration data. The optimization is achieved by designing a suitable proxy function for the quantization loss without calibration data. To maintain inference efficiency, we perform…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Quantization is a very important subject, and the learned linear transformation idea is interesting! The tech in 4.2 and 5.1 does seem to have some promise.
The standard method for quantizing weight matrices in LLMs is not uniform quantization. It's floating-point quantization. Float8, MXFP4, NVFP4, etc are all floating point quantization formats. "The rationale for choosing uniform quantization is because it is highly efficient and broadly supported by most modern hardware accelerators" -- most modern hardware accelerators support floating-point quantization. >Calibration data-free...representative data for calibration is unavailable... even when
S1. Addresses a significant practical limitation of many PTQ methods by eliminating the need for calibration data, enhancing applicability in data-scarce or privacy-sensitive scenarios. S2. The use of the Frobenius norm of the reconstruction error as a proxy loss is well-motivated and empirically validated (via Spearman correlation), providing a principled foundation for the calibration-free optimization. S3. The paper presents a cohesive framework addressing both outlier mitigation (via
**W1**. The paired quantization technique is primarily applied to the $W_v, W_o$ pair due to incompatibility with Rotary Positional Embeddings (RoPE) in the $W_q, W_k$ pair. This limits its applicability in many current architectures that still rely on RoPE, though the authors correctly note the trend toward RoPE-free models. Of course, I don't think this issue is significant. **W2**. The learning of transformation matrices and the adaptive rounding algorithm incur non-trivial offline comput
The paper clearly articulates the critical need for calibration-free quantization in real-world scenarios (privacy, data scarcity, domain shift). The improvements over uniform and random rotation baselines on Gemma2 are clear and consistent, especially for the challenging 3-bit case.
1) The work demonstrates limited novelty and insufficient distinction from prior work. The core idea of learning a transformation (M) to improve quantization is from the central contribution of SpinQuant (Liu et al., 2024). The distinction here is only the use of a proxy loss instead of calibration data and the relaxation of orthonormality constraints. The use of (W1 M^{-1})(M W2) is explicitly discussed and utilized in related works such as QuaRot (Ashkboos et al., 2024b). As for adaptive round
+ Originality: The calibration-free PTQ has its merit since the majority of existing methods require some form of calibration data. + Quality: the reported experimental results in Tables 3-5 show improved performance over baseline methods including Uniform and Random. + Clarity: the majority of the paper is easy to follow. + Significance: PTQ for LLMs remain a timely hot topic in AI research.
-My major concern is about technical novelty and experimental verification. The key ideas presented in this work largely exist in the literature, and the authors did not articulate the motivation behind their approach (other than calibration-free). -The two claims (about the "central questions in scalar PTQ for LLMs") at the end of the intro. section as well as the listed contributions in Sec. 3 often lack substantial justification. For example, the author(s) claimed "three primary contributions
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Mobile Crowdsensing and Crowdsourcing
