Training a Helpful and Harmless Assistant with Reinforcement Learning   from Human Feedback

Yuntao Bai; Andy Jones; Kamal Ndousse; Amanda Askell; Anna Chen; Nova; DasSarma; Dawn Drain; Stanislav Fort; Deep Ganguli; Tom Henighan; Nicholas; Joseph; Saurav Kadavath; Jackson Kernion; Tom Conerly; Sheer El-Showk; Nelson; Elhage; Zac Hatfield-Dodds; Danny Hernandez; Tristan Hume; Scott Johnston,; Shauna Kravec; Liane Lovitt; Neel Nanda; Catherine Olsson; Dario Amodei; Tom; Brown; Jack Clark; Sam McCandlish; Chris Olah; Ben Mann; Jared Kaplan

arXiv:2204.05862·cs.CL·April 13, 2022·362 cites

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova, DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas, Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson, Elhage, Zac Hatfield-Dodds, Danny Hernandez

PDF

Open Access 4 Repos 10 Models 5 Datasets

TL;DR

This paper demonstrates how reinforcement learning from human feedback can effectively align language models to be helpful and harmless, improving performance across various NLP tasks and ensuring robustness.

Contribution

It introduces an iterative online RLHF training method, showing its effectiveness in improving model alignment, performance, and robustness in language models.

Findings

01

Improved NLP evaluation scores across tasks

02

Linear relation between RL reward and KL divergence

03

Effective online preference model updates

Abstract

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications