What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?
Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won, Chung, Iz Beltagy, Julien Launay, Colin Raffel

TL;DR
This study systematically compares different Transformer architectures and pretraining objectives to determine their effectiveness for zero-shot generalization, revealing that causal models excel in pure pretraining while masked models with finetuning perform best overall.
Contribution
It provides a large-scale, systematic evaluation of architecture and pretraining objectives for zero-shot generalization in text-to-text models, including adaptation strategies between architectures.
Findings
Causal decoder-only models trained with autoregressive objectives show strong zero-shot performance.
Masked language modeling with multitask finetuning yields the best overall results.
Pretrained non-causal models can be adapted into causal models and vice versa for improved performance.
Abstract
Large pretrained Transformer language models have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. In particular, we focus on text-to-text models and experiment with three model architectures (causal/non-causal decoder-only and encoder-decoder), trained with two different pretraining objectives (autoregressive and masked language modeling), and evaluated with and without multitask prompted finetuning. We train models with over 5 billion parameters for more than 170 billion tokens, thereby increasing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Layer Normalization · Residual Connection · Softmax
