Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared, Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom, Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler

TL;DR
Scaling up language models to 175 billion parameters significantly enhances their ability to perform a wide range of NLP tasks in a few-shot setting without task-specific training, approaching or surpassing traditional fine-tuning methods.
Contribution
This paper introduces GPT-3, a large-scale autoregressive language model that demonstrates strong few-shot learning capabilities across diverse NLP tasks without fine-tuning.
Findings
GPT-3 achieves competitive performance on many NLP benchmarks.
GPT-3 can generate human-like news articles.
Few-shot learning improves with larger models.
Abstract
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/opt-2.7bmodel· 21k dl· ♡ 8721k dl♡ 87
- 🤗facebook/opt-125mmodel· 7.0M dl· ♡ 2367.0M dl♡ 236
- 🤗tiiuae/falcon-40bmodel· 22k dl· ♡ 243322k dl♡ 2433
- 🤗facebook/opt-350mmodel· 170k dl· ♡ 149170k dl♡ 149
- 🤗facebook/opt-1.3bmodel· 332k dl· ♡ 182332k dl♡ 182
- 🤗facebook/opt-6.7bmodel· 29k dl· ♡ 11829k dl♡ 118
- 🤗facebook/opt-13bmodel· 16k dl· ♡ 6516k dl♡ 65
- 🤗facebook/opt-30bmodel· 12k dl· ♡ 13612k dl♡ 136
- 🤗facebook/opt-66bmodel· 8.2k dl· ♡ 1748.2k dl♡ 174
- 🤗model-attribution-challenge/opt-350mmodel· 14 dl14 dl
Videos
I COOKED A RECIPE MADE BY A.I. | Cooking with GPT-3 (Don't try this at home)· youtube
GPT-3: Language Models are Few-Shot Learners (Paper Explained)· youtube
OpenAI GPT-3 - Good At Almost Everything! 🤖· youtube
GPT-3 explained with examples. Possibilities, and implications.· youtube
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsLinear Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Layer Normalization · Attention Dropout · Weight Decay · Adam · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections
