CBQ: Cross-Block Quantization for Large Language Models
Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting, Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, Yunhe Wang

TL;DR
CBQ introduces a novel cross-block quantization method for large language models that effectively handles outliers and dependencies across multiple blocks, significantly improving low-bit quantization accuracy and efficiency.
Contribution
The paper proposes CBQ, a cross-block reconstruction-based PTQ method with a coarse-to-fine preprocessing and adaptive LoRA-Rounding, enabling better low-bit quantization of LLMs.
Findings
CBQ outperforms existing methods on various LLMs and datasets.
CBQ quantizes the LLAMA1-65B model to 4 bits in 4.3 hours on a single GPU.
CBQ achieves a favorable balance between model performance and quantization efficiency.
Abstract
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs. However, existing PTQ methods only focus on handling the outliers within one layer or one block, which ignores the dependency of blocks and leads to severe performance degradation in low-bit settings. In this paper, we propose CBQ, a cross-block reconstruction-based PTQ method for LLMs. CBQ employs a cross-block dependency using a homologous reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation. Furthermore, CBQ incorporates a coarse-to-fine preprocessing (CFP) strategy for suppressing weight and activation outliers, coupled with an adaptive LoRA-Rounding technique for precise weight quantization. These innovations enable CBQ to not only handle extreme outliers effectively but also improve overall…
Peer Reviews
Decision·ICLR 2025 Spotlight
1. The paper offers an interesting insight: full model quantization introduces inter-layer correlations. 2. Experimental results on LLAMA1, LLAMA2, and OPT demonstrate substantial improvements in W2 and W4 settings, alongside a significant reduction in PTQ scenario.
1. The writing could be improved, especially regarding clarity in notation. While the concepts are interesting, it is difficult to follow due to unclear notation. For example, symbols like \(i\), \(j\), and \(K\) in Equation 3, as well as \(V\) in Equation 11, are not clearly defined when first introduced. It would help if each symbol had an explicit definition upon first use. Additionally, the term "scales" in the phrase "comparisons of the scales between adjacent layers..." lacks clarity. Spec
I have carefully read this work, I think the result of this work was convincing, and the approach was solid.
N/A
1. By focusing on inter-block dependencies, CBQ takes a proactive approach to minimize error accumulation, which is often a critical challenge in low-bit quantization. This dependency handling shows clear improvements in model accuracy. 2. The design of CBQ, particularly the coarse-to-fine outlier preprocessing and adaptive rounding, ensures flexibility across different LLM sizes, making it a practical choice for varied deployment needs. 3. The authors have included a wide range of experiments s
1. While the paper demonstrates that overlapping blocks in CBQ contributes to performance, it lacks an in-depth analysis of how varying the overlap size impacts memory efficiency, latency, and overall quantization stability. Providing such details would clarify practical deployment considerations. 2. The coarse-to-fine preprocessing for outliers appears effective; however, an assessment of its necessity relative to simpler methods could be useful. It is unclear whether this specific strategy is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
MethodsFocus
