Train Large, Then Compress: Rethinking Model Size for Efficient Training   and Inference of Transformers

Zhuohan Li; Eric Wallace; Sheng Shen; Kevin Lin; Kurt Keutzer; Dan; Klein; Joseph E. Gonzalez

arXiv:2002.11794·cs.CL·June 24, 2020·52 cites

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan, Klein, Joseph E. Gonzalez

PDF

Open Access 2 Repos

TL;DR

Training very large Transformer models briefly can be more compute-efficient than training smaller models extensively, and these large models can be effectively compressed to outperform smaller, lightly compressed models in accuracy.

Contribution

The paper demonstrates that large Transformer models converge faster and are more robust to compression, enabling efficient training and high-accuracy compressed models.

Findings

01

Large models converge in fewer steps despite higher per-iteration cost.

02

Training large models briefly is more compute-efficient than extensive training of small models.

03

Large models retain robustness to compression, leading to better accuracy after compression.

Abstract

Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications

MethodsPruning · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam