ARB-LLM: Alternating Refined Binarizations for Large Language Models
Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, zhongchao shi, Linghe Kong, Yulun Zhang, Xiaokang Yang

TL;DR
ARB-LLM introduces an innovative binarization technique for large language models that reduces memory and computation demands while surpassing FP16 performance, through progressive parameter updates and weight partition refinement.
Contribution
The paper presents ARB-LLM, a novel 1-bit post-training quantization method with alternating refined binarization and column-group strategies, outperforming existing binarization methods for LLMs.
Findings
Significantly reduces quantization error in LLMs.
Outperforms state-of-the-art binarization methods.
Surpasses FP16 models in performance.
Abstract
Large Language Models (LLMs) have greatly pushed forward advancements in natural language processing, yet their high memory and computational demands hinder practical deployment. Binarization, as an effective compression technique, can shrink model weights to just 1 bit, significantly reducing the high demands on computation and memory. However, current binarization methods struggle to narrow the distribution gap between binarized and full-precision weights, while also overlooking the column deviation in LLM weight distribution. To tackle these issues, we propose ARB-LLM, a novel 1-bit post-training quantization (PTQ) technique tailored for LLMs. To narrow the distribution shift between binarized and full-precision weights, we first design an alternating refined binarization (ARB) algorithm to progressively update the binarization parameters, which significantly reduces the quantization…
Peer Reviews
Decision·ICLR 2025 Poster
1. The proposed ARB-LLM method is innovative, which also addresses a critical problem in binary quantization. 2. The experiments on various LLM families demonstrate the superiority of the proposed methods. 3. The paper also provides thorough ablation studies the effectiveness of the individual components of ARB-LLM.
1. The main weakness of the paper is the evaluation seems limited to perplexity and average accuracy. The paper would be more interesting if there were trade-offs on more tasks.
The paper identifies and addresses the distribution shift between floating-point weights and binary weights through an iterative refinement method. The study shows that through the ARB process, the quantization error decreases. The paper further extends the method by adding a calibration dataset and column-wise scales, resulting in better performance compared to BiLLM models. The paper offers a comprehensive ablation study and time/memory analysis, showing the trade-offs of each component.
The authors claim that the weight bit is 1.11 bits, but based on the memory comparison, the model takes approximately 3GB of memory on the Llama-7B model, which is similar in size to GPTQ-3bit of the same model, with inferior performance. Can the authors clarify this? It seems the approach works better on the older OPT model than the Llama family, with Llama3's performance, in particular, being much worse than floating-point. Can the authors showcase the performance on more recent models like P
Authors proposed several improvements of BiLLM method and show that ARB-LLMRC outperforms FP16 models of the same size (in GB) and outperforms BiLLM (it is SOTA of LLM post training binarization) Extensive ablation study of the proposed methods showed effectiveness of Effectiveness of ARB, ARB-LLMX, ARB-LLMRC its combination with CGB and calibration data size, iteration number and group number. Conducted extensive experiments on the LLaMA, LLaMA-2, and LLaMA3 families. They did a thorough an
Q: On Figure 1, authors show Pareto curve and demonstrate that ARB-LLM-RC outperforms the same-size FP16 models and previously published binarized models. So authors show an improvements over previous models binarization, but it does not show the whole picture in terms of pareto curve: they only compare binarized models vs fp16 (on Figure 1). But it will be informative to show also 4bits and 8bits models. For example paper "The case for 4-bit precision: k-bit Inference Scaling Laws" shows that
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
