Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Noam Shazeer, Mitchell Stern

TL;DR
This paper introduces Adafactor, an optimizer that reduces memory usage by maintaining only row and column sums of second-moment estimates, while achieving comparable performance to Adam in training neural networks.
Contribution
The paper proposes a memory-efficient adaptive optimizer, Adafactor, with novel update schemes and scaling methods, enabling large-scale training with minimal auxiliary storage.
Findings
Adafactor matches Adam's performance in training Transformer models.
The method significantly reduces memory requirements for second-moment estimation.
Update clipping and decay schemes improve training stability.
Abstract
In several recently proposed stochastic optimization methods (e.g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients. Maintaining these per-parameter second-moment estimators requires memory equal to the number of parameters. For the case of neural network weight matrices, we propose maintaining only the per-row and per-column sums of these moving averages, and estimating the per-parameter second moments based on these sums. We demonstrate empirically that this method produces similar results to the baseline. Secondly, we show that adaptive methods can produce larger-than-desired updates when the decay rate of the second moment accumulator is too slow. We propose update clipping and a gradually increasing decay rate scheme as remedies. Combining these methods and dropping momentum, we achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Model Reduction and Neural Networks
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · RMSProp · Residual Connection · Adafactor · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia?
