Training language models to follow instructions with human feedback

Long Ouyang; Jeff Wu; Xu Jiang; Diogo Almeida; Carroll L. Wainwright,; Pamela Mishkin; Chong Zhang; Sandhini Agarwal; Katarina Slama; Alex Ray; John; Schulman; Jacob Hilton; Fraser Kelton; Luke Miller; Maddie Simens; Amanda; Askell; Peter Welinder; Paul Christiano; Jan Leike; Ryan Lowe

arXiv:2203.02155·cs.CL·March 7, 2022·4.3k cites

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright,, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John, Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda, Askell, Peter Welinder, Paul Christiano, Jan Leike

PDF

Open Access 5 Repos 10 Models 5 Datasets 1 Video

TL;DR

This paper demonstrates that fine-tuning language models with human feedback significantly improves their alignment with user intent, truthfulness, and safety, even with smaller models.

Contribution

It introduces a method for aligning language models with human preferences using supervised fine-tuning and reinforcement learning from human feedback, resulting in the InstructGPT models.

Findings

01

InstructGPT outperforms larger GPT-3 models in human preference tests.

02

Fine-tuning with human feedback improves truthfulness and reduces toxicity.

03

Smaller models can match or exceed larger models' performance through this method.

Abstract

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

Reinforcement Learning with Human Feedback (RLHF), Clearly Explained!!!· youtube

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Attention Dropout · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Cosine Annealing · Dense Connections · Residual Connection