One Epoch Is All You Need
Aran Komatsuzaki

TL;DR
Training large Transformer language models in a single epoch with proper size and iteration adjustments can dramatically improve efficiency, reduce costs, and potentially enable faster development of state-of-the-art models.
Contribution
This paper proposes a novel approach of training on larger datasets for only one epoch, with heuristic adjustments to model size and iterations, leading to significant speedups and cost reductions.
Findings
One epoch training yields no overfitting and faster convergence.
Adjusting model size and iterations improves training speed by up to 2.7x.
Combined methods achieve up to 5.1x speedup in training efficiency.
Abstract
In unsupervised learning, collecting more data is not always a costly process unlike the training. For example, it is not hard to enlarge the 40GB WebText used for training GPT-2 by modifying its sampling methodology considering how many webpages there are in the Internet. On the other hand, given that training on this dataset already costs tens of thousands of dollars, training on a larger dataset naively is not cost-wise feasible. In this paper, we suggest to train on a larger dataset for only one epoch unlike the current practice, in which the unsupervised models are trained for from tens to hundreds of epochs. Furthermore, we suggest to adjust the model size and the number of iterations to be performed appropriately. We show that the performance of Transformer language model becomes dramatically improved in this way, especially if the original number of epochs is greater. For…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · Byte Pair Encoding
