Scaling Instruction-Finetuned Language Models

Hyung Won Chung; Le Hou; Shayne Longpre; Barret Zoph; Yi Tay; William; Fedus; Yunxuan Li; Xuezhi Wang; Mostafa Dehghani; Siddhartha Brahma; Albert; Webson; Shixiang Shane Gu; Zhuyun Dai; Mirac Suzgun; Xinyun Chen; Aakanksha; Chowdhery; Alex Castro-Ros; Marie Pellat; Kevin Robinson; Dasha Valter,; Sharan Narang; Gaurav Mishra; Adams Yu; Vincent Zhao; Yanping Huang; Andrew; Dai; Hongkun Yu; Slav Petrov; Ed H. Chi; Jeff Dean; Jacob Devlin; Adam; Roberts; Denny Zhou; Quoc V. Le; Jason Wei

arXiv:2210.11416·cs.LG·December 8, 2022·1.2k cites

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William, Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert, Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha, Chowdhery, Alex Castro-Ros, Marie Pellat

PDF

Open Access 5 Repos 10 Models 3 Datasets

TL;DR

This paper demonstrates that instruction finetuning, especially with increased task diversity, model size, and chain-of-thought data, significantly enhances language models' performance across various tasks and benchmarks.

Contribution

It systematically explores the effects of scaling instruction finetuning in terms of tasks, model size, and data type, achieving state-of-the-art results.

Findings

01

Flan-PaLM 540B outperforms original PaLM 540B by 9.4% on average.

02

State-of-the-art 75.2% on five-shot MMLU.

03

Flan-T5 checkpoints show strong few-shot performance, rivaling larger models.

Abstract

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsAttention Is All You Need · Flan-T5 · Pathways Language Model · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Adafactor · SentencePiece · Gated Linear Unit