Training with Fewer Bits: Unlocking Edge LLMs Training with Stochastic Rounding
Taowen Liu, Marta Andronic, Deniz G\"und\"uz, George A. Constantinides

TL;DR
This paper investigates how stochastic rounding enables efficient low-bit training of large language models by analyzing its interaction with batch size and quantization effects, supported by theoretical and empirical results.
Contribution
It provides a theoretical and empirical analysis of stochastic rounding in low-bit LLM training, highlighting how batch size and quantization influence convergence and accuracy.
Findings
Increased batch size compensates for reduced precision during training.
Quantizing weights and activations affects gradient variance differently.
Experimental results validate the theoretical insights.
Abstract
LLM training is resource-intensive. Quantized training improves computational and memory efficiency but introduces quantization noise, which can hinder convergence and degrade model accuracy. Stochastic Rounding (SR) has emerged as a theoretically attractive alternative to deterministic rounding, offering unbiased gradient estimates. However, its interaction with other training factors -- especially batch size -- remains under explored. In this paper, we present a theoretical and empirical study of mini-batch stochastic gradient descent (SGD) with SR, showing that increased batch sizes can compensate for reduced precision during back-propagation. Furthermore, we show that quantizing weights and activations impacts gradient variance in distinct ways. Our experiments validate these theoretical insights.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis
