Why Does Stochastic Gradient Descent Slow Down in Low-Precision Training?
Vincent-Daniel Yun

TL;DR
This paper analyzes how low-precision gradient quantization causes gradient shrinkage, leading to slower convergence and higher steady error in stochastic gradient descent, with theoretical insights into the effects of reduced numerical precision.
Contribution
It introduces a gradient shrinkage model for low-precision training and provides theoretical analysis of its impact on SGD convergence and steady-state error.
Findings
Low-precision training causes gradient magnitude shrinkage.
Convergence slows proportionally to the minimum shrinkage factor.
Steady-state error increases with lower numerical precision.
Abstract
Low-precision training has become crucial for reducing the computational and memory costs of large-scale deep learning. However, quantizing gradients introduces magnitude shrinkage, which can change how stochastic gradient descent (SGD) converges. In this study, we explore SGD convergence under a gradient shrinkage model, where each stochastic gradient is scaled by a factor \( q_k \in (0,1] \). We show that this shrinkage affect the usual stepsize \( \mu_k \) with an effective stepsize \( \mu_k q_k \), slowing convergence when \( q_{\min} < 1 \). With typical smoothness and bounded-variance assumptions, we prove that low-precision SGD still converges, but at a slower pace set by \( q_{\min} \), and with a higher steady error level due to quantization effects. We analyze theoretically how lower numerical precision slows training by treating it as gradient shrinkage within the standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Generative Adversarial Networks and Image Synthesis
