BASE-Q: Bias and Asymmetric Scaling Enhanced Rotational Quantization for Large Language Models
Liulu He, Shenli Zheng, Karwei Sun, Yijiang Liu, Yufei Zhao, Chongkang Tan, Huanrui Yang, Yuan Du, Li Du

TL;DR
BASE-Q introduces bias correction and asymmetric scaling in rotational quantization, significantly improving large language model performance while reducing memory overhead and enabling blockwise optimization.
Contribution
It proposes a novel bias and asymmetric scaling method that addresses key limitations of existing rotational quantization, enabling efficient blockwise training.
Findings
Narrowed accuracy gap to full-precision models by over 50%.
Reduced training memory consumption with blockwise optimization.
Outperformed prior rotational quantization methods on various benchmarks.
Abstract
Rotations have become essential to state-of-the-art quantization pipelines for large language models (LLMs) by effectively smoothing outliers in weights and activations. However, further optimizing the rotation parameters offers only limited performance gains and introduces significant training overhead: due to rotation parameter sharing, full-model must be loaded simultaneously to enable backpropagation, resulting in substantial memory consumption and limited practical utility. In this work, we identify two fundamental limitations of current rotational quantization methods: (i) rotation fails to align channel means, resulting in wider quantization bounds and increased rounding errors; and (ii) rotation makes the activation distribution more Gaussian-like, increasing energy loss caused by clipping errors. To address these issues, we introduce \textbf{BASE-Q}, a simple yet powerful…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper explains the reason behind all of the design choices in the proposed method. The method is effective and shows strong results compared with existing SoTA methods that are used as baselines. The authors include ablations to showcase the importance of each component. In addition to theoretical efficiency arguments, the authors implemented optimized kernels that allow them to obtain real-world speedup more than 2x when using 4-bit quantization.
The writing can be improved quite a lot. Important details are either missing or are scattered in different places. To name a few exampels: 1. The dimensions of newly introduced parameters are not clearly specified. It is hard to understand whether a parameter is scalar, a vector or a matrix. 2. There are two different scaling mentioned in the paper. These are referred by different names in different places. For example, Table 2 calls them unpaired scale and scale. There is no other mentioned
- The authors conduct extensive empirical results to demonstrate the effectiveness of BASE-Q against a set of leading baselines such as SpinQuant and OSTQuant. The ablation study in Table 2 also looks adequate. - The authors also customize the kernel to further speed up the inference of BASE-Q. Figure 5 looks promising, and the proposed method introduces little overhead.
- The proposed method is largely built upon existing frameworks like SpinQuant. Both bias correction and asymmetric scaling are kind of incremental to the baseline. Moreover, the training paradigm (e.g., blockwise training) also follows SpinQuant. - The writing is kind of hard to follow. Some equations should be explained in more detail (e.g., Equation 7, 8). The necessary derivations are missing. The logic can be a bit messy. For instance, it can be hard to understand the expectation of roundi
1. Compared with the previous approach of optimizing the global rotation matrix, this method can be optimized with fewer GPU resources. 2. The authors conducted extensive experiments on both QWen series and LLaMA series models under the W4A4KV4 quantization configuration. Across these models, BASE-Q consistently achieved performance improvements.
1. The author only tested the effectiveness of the proposed method in basic experiments such as Zero-Shot and PPL evaluations. How does this method perform on more complex benchmark datasets like MMLU? 2. While I acknowledge the rationality of the authors' method, it appears that their approach is essentially OminiQuant under rotation conditions. Additionally, the fused-bias technique they employed is not particularly novel—Outlier Suppression++ has also adopted the fused-bias technology. Theref
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Neural Network Applications
