Language Models are Few-Shot Learners

Tom B. Brown; Benjamin Mann; Nick Ryder; Melanie Subbiah; Jared; Kaplan; Prafulla Dhariwal; Arvind Neelakantan; Pranav Shyam; Girish Sastry,; Amanda Askell; Sandhini Agarwal; Ariel Herbert-Voss; Gretchen Krueger; Tom; Henighan; Rewon Child; Aditya Ramesh; Daniel M. Ziegler; Jeffrey Wu; Clemens; Winter; Christopher Hesse; Mark Chen; Eric Sigler; Mateusz Litwin; Scott; Gray; Benjamin Chess; Jack Clark; Christopher Berner; Sam McCandlish; Alec; Radford; Ilya Sutskever; Dario Amodei

arXiv:2005.14165·cs.CL·July 24, 2020·3.0k cites

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared, Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom, Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler

PDF

Open Access 5 Repos 10 Models 5 Datasets 6 Videos

TL;DR

Scaling up language models to 175 billion parameters significantly enhances their ability to perform a wide range of NLP tasks in a few-shot setting without task-specific training, approaching or surpassing traditional fine-tuning methods.

Contribution

This paper introduces GPT-3, a large-scale autoregressive language model that demonstrates strong few-shot learning capabilities across diverse NLP tasks without fine-tuning.

Findings

01

GPT-3 achieves competitive performance on many NLP benchmarks.

02

GPT-3 can generate human-like news articles.

03

Few-shot learning improves with larger models.

Abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

I COOKED A RECIPE MADE BY A.I. | Cooking with GPT-3 (Don't try this at home)· youtube

GPT-3: Language Models are Few-Shot Learners (Paper Explained)· youtube

OpenAI GPT-3 - Good At Almost Everything! 🤖· youtube

GPT-3 explained with examples. Possibilities, and implications.· youtube

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsLinear Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Layer Normalization · Attention Dropout · Weight Decay · Adam · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections