TL;DR
This paper presents a novel progressive training method for 1-bit quantization of large language models, effectively leveraging pre-trained models to reduce costs and improve accuracy.
Contribution
It introduces a consistent progressive training approach with binary-aware initialization and dual-scaling, enabling high-performance 1-bit LLMs from pre-trained models.
Findings
Outperforms existing 1-bit LLM quantization methods.
Reduces training costs by leveraging pre-trained models.
Achieves high accuracy with 1-bit quantized LLMs.
Abstract
1-bit LLM quantization offers significant advantages in reducing storage and computational costs. However, existing methods typically train 1-bit LLMs from scratch, failing to fully leverage pre-trained models. This results in high training costs and notable accuracy degradation. We identify that the large gap between full precision and 1-bit representations makes naive adaptation difficult. In this paper, we introduce a consistent progressive training for both forward and backward, smoothly converting the full-precision weights into the binarized ones. Additionally, we incorporate binary-aware initialization and dual-scaling compensation to reduce the difficulty of progressive training and improve the performance. Experimental results on LLMs of various sizes demonstrate that our method outperforms existing approaches. Our results show that high-performance 1-bit LLMs can be achieved…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper presents a thoughtful and well-motivated attempt to make 1-bit quantization more practical for large language models. The idea of leveraging pre-trained weights rather than training from scratch is both efficient and timely, and the proposed progressive training and dual-scaling strategies are conceptually sound. The paper is clearly written with comprehensive experiments.
Although the authors include additional results on LLaMA2-7B in the appendix, larger-scale validation (e.g., 13B or 30B models) is still missing, leaving some uncertainty about scalability under truly large-model settings. The comparison baselines are reasonably strong, but despite explicitly discussing the instability of 1-bit quantization on newer models such as Qwen3, the authors do not include direct experiments on it. This omission leaves open how well BinaryLLM performs on the latest LLM a
1. For the 130M model, BinaryLLM requires only 20B training tokens. This is significantly fewer than the 1.26T tokens needed for training a model from scratch. 2. The insight that well-trained models are harder to quantize while they still outperform under-trained ones after binarization, is interesting, and this point is thoroughly discussed. 3. The motivation is clear. The method is generalizable, and the performance is satisfactory.
1. For the ablation study on progressive training (Table 4), the comparison should use BinaryLLM without binary-aware initialization and dual-scaling. Merely comparing results between BinaryLLM and IR-Net fails to highlight the effectiveness of progressive training. 2. The binary-aware initialization yields only marginal improvements, as shown in Table 9. 3. There are some typos. For instance, line 372 should read "from smallest to largest".
The paper is well written.
1. PTB perplexity evaluation is confusing. For example, on the 3B model, BitNet b1.58 clearly outperforms the proposed BinaryLLM on C4 and WikiText2 perplexity, but performs significantly worse on PTB perplexity. This inconsistency is also observed in other model evaluations, including the 1.3B experiments. It is recommended to recheck the PTB perplexity evaluation. 2. In Section 4.2, the training data and base LLM are not aligned. In the 130M experiment, BinaryLLM uses SmolLM data, while FBI-L
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Speech Recognition and Synthesis · Natural Language Processing Techniques
