What Language Model Architecture and Pretraining Objective Work Best for   Zero-Shot Generalization?

Thomas Wang; Adam Roberts; Daniel Hesslow; Teven Le Scao; Hyung Won; Chung; Iz Beltagy; Julien Launay; Colin Raffel

arXiv:2204.05832·cs.CL·April 13, 2022·23 cites

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won, Chung, Iz Beltagy, Julien Launay, Colin Raffel

PDF

Open Access 1 Repo

TL;DR

This study systematically compares different Transformer architectures and pretraining objectives to determine their effectiveness for zero-shot generalization, revealing that causal models excel in pure pretraining while masked models with finetuning perform best overall.

Contribution

It provides a large-scale, systematic evaluation of architecture and pretraining objectives for zero-shot generalization in text-to-text models, including adaptation strategies between architectures.

Findings

01

Causal decoder-only models trained with autoregressive objectives show strong zero-shot performance.

02

Masked language modeling with multitask finetuning yields the best overall results.

03

Pretrained non-causal models can be adapted into causal models and vice versa for improved performance.

Abstract

Large pretrained Transformer language models have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. In particular, we focus on text-to-text models and experiment with three model architectures (causal/non-causal decoder-only and encoder-decoder), trained with two different pretraining objectives (autoregressive and masked language modeling), and evaluated with and without multitask prompted finetuning. We train models with over 5 billion parameters for more than 170 billion tokens, thereby increasing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bigscience-workshop/architecture-objective
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Layer Normalization · Residual Connection · Softmax