Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer; Mitchell Stern

arXiv:1804.04235·cs.LG·April 13, 2018·163 cites

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer, Mitchell Stern

PDF

Open Access 5 Repos 4 Models

TL;DR

This paper introduces Adafactor, an optimizer that reduces memory usage by maintaining only row and column sums of second-moment estimates, while achieving comparable performance to Adam in training neural networks.

Contribution

The paper proposes a memory-efficient adaptive optimizer, Adafactor, with novel update schemes and scaling methods, enabling large-scale training with minimal auxiliary storage.

Findings

01

Adafactor matches Adam's performance in training Transformer models.

02

The method significantly reduces memory requirements for second-moment estimation.

03

Update clipping and decay schemes improve training stability.

Abstract

In several recently proposed stochastic optimization methods (e.g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients. Maintaining these per-parameter second-moment estimators requires memory equal to the number of parameters. For the case of neural network weight matrices, we propose maintaining only the per-row and per-column sums of these moving averages, and estimating the per-parameter second moments based on these sums. We demonstrate empirically that this method produces similar results to the baseline. Secondly, we show that adaptive methods can produce larger-than-desired updates when the decay rate of the second moment accumulator is too slow. We propose update clipping and a gradually increasing decay rate scheme as remedies. Combining these methods and dropping momentum, we achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Model Reduction and Neural Networks

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · RMSProp · Residual Connection · Adafactor · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia?