Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar,, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler

TL;DR
This paper investigates how scaling Transformer models affects downstream tasks, revealing that model shape and scaling protocols are crucial, and introduces more efficient models that reduce parameters and training time while maintaining performance.
Contribution
It demonstrates the importance of model shape in downstream fine-tuning, proposes improved scaling protocols, and releases pretrained models to advance research.
Findings
Model shape influences downstream fine-tuning performance.
Scaling protocols vary with compute regions.
Proposed models are 50% smaller and 40% faster with similar quality.
Abstract
There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact. The goal of this paper is to present scaling insights from pretraining and finetuning Transformers. While Kaplan et al. presents a comprehensive study of the scaling behaviour of Transformer language models, the scope is only on the upstream (pretraining) loss. Therefore, it is still unclear if these set of findings transfer to downstream task within the context of the pretrain-finetune paradigm. The key findings of this paper are as follows: (1) we show that aside from only the model size, model shape matters for downstream fine-tuning, (2) scaling protocols operate differently at different compute regions, (3)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/t5-efficient-base-dl2model· 11 dl11 dl
- 🤗google/t5-efficient-base-dl4model· 13 dl13 dl
- 🤗google/t5-efficient-base-dl6model· 16 dl16 dl
- 🤗google/t5-efficient-base-dl8model· 13 dl· ♡ 213 dl♡ 2
- 🤗google/t5-efficient-base-dm1000model· 64 dl64 dl
- 🤗google/t5-efficient-base-dm2000model· 13 dl· ♡ 113 dl♡ 1
- 🤗google/t5-efficient-base-dm256model· 15 dl15 dl
- 🤗google/t5-efficient-base-dm512model· 14 dl14 dl
- 🤗google/t5-efficient-base-el16model· 13 dl· ♡ 113 dl♡ 1
- 🤗google/t5-efficient-base-el2model· 15 dl15 dl
Videos
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Attention Dropout · Gated Linear Unit · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · Adafactor · Softmax
