Neural gradients are near-lognormal: improved quantized and sparse training
Brian Chmiel, Liad Ben-Uri, Moran Shkolnik, Elad Hoffer, Ron Banner,, Daniel Soudry

TL;DR
This paper reveals that neural gradients follow a near-lognormal distribution and introduces two analytical methods to quantize and sparsify gradients, significantly reducing computational costs while maintaining accuracy.
Contribution
It is the first to quantize gradients to 6-bit floating-point and achieve up to 85% sparsity without accuracy loss, leveraging the lognormal distribution of gradients.
Findings
Gradient quantization to 6-bit floating-point achieves state-of-the-art results.
Gradient sparsity of up to 85% is achieved without accuracy degradation.
Proposed methods reduce computational and memory burdens effectively.
Abstract
While training can mostly be accelerated by reducing the time needed to propagate neural gradients back throughout the model, most previous works focus on the quantization/pruning of weights and activations. These methods are often not applicable to neural gradients, which have very different statistical properties. Distinguished from weights and activations, we find that the distribution of neural gradients is approximately lognormal. Considering this, we suggest two closed-form analytical methods to reduce the computational and memory burdens of neural gradients. The first method optimizes the floating-point format and scale of the gradients. The second method accurately sets sparsity thresholds for gradient pruning. Each method achieves state-of-the-art results on ImageNet. To the best of our knowledge, this paper is the first to (1) quantize the gradients to 6-bit floating-point…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
