The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester; Rami Al-Rfou; Noah Constant

arXiv:2104.08691·cs.CL·September 3, 2021

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, Noah Constant

PDF

5 Repos 8 Models 1 Datasets

TL;DR

This paper demonstrates that prompt tuning with soft prompts is an effective, scalable, and resource-efficient method for adapting large frozen language models to specific tasks, outperforming few-shot GPT-3 and matching full model tuning at large scales.

Contribution

It introduces a prompt tuning method that outperforms GPT-3's few-shot learning and becomes more competitive with scale, simplifying adaptation of large models for multiple tasks.

Findings

01

Prompt tuning outperforms GPT-3's few-shot learning.

02

Performance gap closes as model size exceeds billions of parameters.

03

Soft prompts improve robustness to domain transfer.

Abstract

In this work, we explore "prompt tuning", a simple yet effective mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's "few-shot" learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method "closes the gap" and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant in that large models are costly to share and serve, and the ability to reuse one frozen model for multiple downstream tasks can ease this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

succinctly/medium-titles-and-images
dataset· 33 dl
33 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsGated Linear Unit · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Adafactor · Inverse Square Root Schedule · Attention Dropout · Layer Normalization · Residual Connection · Weight Decay · Multi-Head Attention