Unlocking the Theory Behind Scaling 1-Bit Neural Networks
Majid Daliri, Zhao Song, Chiwun Yang

TL;DR
This paper provides the first theoretical proof of a scaling law for 1-bit neural networks, showing that their training dynamics align with kernel behavior as the network width increases, leading to improved performance.
Contribution
It establishes a rigorous theoretical foundation for the scaling behavior of 1-bit neural networks and introduces the concept of generalization difference.
Findings
Training dynamics align with kernel behavior as width increases
Loss converges to arbitrarily small values with increasing width
Generalization difference remains negligible as networks scale
Abstract
Recently, 1-bit Large Language Models (LLMs) have emerged, showcasing an impressive combination of efficiency and performance that rivals traditional LLMs. Research by Wang et al. (2023); Ma et al. (2024) indicates that the performance of these 1-bit LLMs progressively improves as the number of parameters increases, hinting at the potential existence of a Scaling Law for 1-bit Neural Networks. In this paper, we present the first theoretical result that rigorously establishes this scaling law for 1-bit models. We prove that, despite the constraint of weights restricted to , the dynamics of model training inevitably align with kernel behavior as the network width grows. This theoretical breakthrough guarantees convergence of the 1-bit model to an arbitrarily small loss as width increases. Furthermore, we introduce the concept of the generalization difference, defined as the…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The topic this paper tries to push is timely, as there have been several works in Quantization Aware Training of low-bitwidth models recently, which show how these models start working better at scale. 2. This paper takes an interesting NTK perspective to justify training dynamics and generalization of binary neural networks. They also derive a scaling law of 1-bit neural networks under certain assumptions.
1. The main weakness of the paper is the lack of realistic experiments and the positioning of the paper. The authors use 1-bit LLMs to motivate the paper in the abstract and introduction, but their theory and experiments are for quantized MLPs at a very small scale. The authors should rewrite the abstract and introduction to clarify what their focus is on rather than overemphasizing low-bitwidth LLMs. For example, authors should explicitly state their focus is on quantized MLPs for learning func
- The paper studies an important topic of training quantized networks, from a theoretical point of view. - Showing convergence of 1-bit network with quantization aware training is, to my knowledge, novel. - Studying the relation between 1-bit network and full-precision networks is interesting and important.
- Lemma 4.1: the results in the lemma depend on $\lambda$, which I assume is $\lambda_{\min}(H^*)$ (this should be properly stated). However, if this is the case $\lambda$ should depend explicitly on $\kappa^2$, and the dependence on $\kappa^2$ should be tracked throughout the results. It is better to define $\lambda$ as the minimal eigenvalue of the non-scaled Gram matrix, and have the explicit dependence on $\kappa^2$ in the results. - It is worth noting that the scaling laws derived in propos
- This paper studies an important problem as to scaling 1-bit neural networks. Given the increasing deployment of large foundation models, how to reduce their energy consumptions become an increasingly important problem. 1-bit quantization is a promising approach to study. - The authors conduct thorough experiments to validate the theoretical analysis.
- This paper studies the scaling law of 1-bit neural networks. However, the analysis is performed on simple toy models. The original scaling law as proposed by Kaplan [1] is conducted on Transformers with millions/billions of parameters with huge amounts of pretraining compute. The goal of scaling law should be to study how to reliably predict the performance of pretraining with larger compute. However, the setting of this paper does not fit in the category of “scaling law”, given the small sca
1) The paper introduces a variation of the soft-commite machine that can be analytically traced to show the convergence of 1-bit models. 2) The convergence is analysed for the number of model parameters and the size of the training data set.
1) Given that the paper provides numerical results, I found it problematic that the main result of the paper (the functional form of the convergence of the training loss of 1-bit models) is not verified with numerical results. 2) The finding of a "scaling law" in Fig.1 with four data points, no comparison with a theory curve, and significant fluctuations is, in my opinion, not a valid verification of Proposition 4.3 at all. 3) The figures are not properly described/referenced in the text.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsALIGN
