Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William, Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert, Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha, Chowdhery, Alex Castro-Ros, Marie Pellat

TL;DR
This paper demonstrates that instruction finetuning, especially with increased task diversity, model size, and chain-of-thought data, significantly enhances language models' performance across various tasks and benchmarks.
Contribution
It systematically explores the effects of scaling instruction finetuning in terms of tasks, model size, and data type, achieving state-of-the-art results.
Findings
Flan-PaLM 540B outperforms original PaLM 540B by 9.4% on average.
State-of-the-art 75.2% on five-shot MMLU.
Flan-T5 checkpoints show strong few-shot performance, rivaling larger models.
Abstract
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/flan-t5-xxlmodel· 18k dl· ♡ 128118k dl♡ 1281
- 🤗google/flan-t5-basemodel· 1.2M dl· ♡ 10611.2M dl♡ 1061
- 🤗google/flan-t5-smallmodel· 567k dl· ♡ 470567k dl♡ 470
- 🤗Salesforce/blip2-flan-t5-xxlmodel· 1.8k dl· ♡ 941.8k dl♡ 94
- 🤗google/flan-t5-largemodel· 421k dl· ♡ 877421k dl♡ 877
- 🤗google/flan-t5-xlmodel· 134k dl· ♡ 530134k dl♡ 530
- 🤗ybelkada/switch-base-8-xsummodel· 5 dl· ♡ 35 dl♡ 3
- 🤗philschmid/flan-t5-xxl-sharded-fp16model· 103 dl· ♡ 56103 dl♡ 56
- 🤗Salesforce/blip2-flan-t5-xlmodel· 73k dl· ♡ 9173k dl♡ 91
- 🤗Salesforce/blip2-flan-t5-xl-cocomodel· 542 dl· ♡ 16542 dl♡ 16
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsAttention Is All You Need · Flan-T5 · Pathways Language Model · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Adafactor · SentencePiece · Gated Linear Unit
