Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova, DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas, Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson, Elhage, Zac Hatfield-Dodds, Danny Hernandez

TL;DR
This paper demonstrates how reinforcement learning from human feedback can effectively align language models to be helpful and harmless, improving performance across various NLP tasks and ensuring robustness.
Contribution
It introduces an iterative online RLHF training method, showing its effectiveness in improving model alignment, performance, and robustness in language models.
Findings
Improved NLP evaluation scores across tasks
Linear relation between RL reward and KL divergence
Effective online preference model updates
Abstract
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗meta-llama/Llama-Guard-3-8Bmodel· 83k dl· ♡ 28383k dl♡ 283
- 🤗sileod/deberta-v3-large-tasksource-rlhf-reward-modelmodel· 97 dl· ♡ 1197 dl♡ 11
- 🤗stephenfitz/llm-jp-13b-instruct-full-jaster-dpomodel
- 🤗meta-llama/Llama-Guard-3-8B-INT8model· 8.6k dl· ♡ 388.6k dl♡ 38
- 🤗QuantFactory/Llama-Guard-3-8B-GGUFmodel· 363 dl· ♡ 2363 dl♡ 2
- 🤗garak-llm/attackgeneration-toxicity_gpt2model· 49k dl· ♡ 449k dl♡ 4
- 🤗Najii/Llama-Guardmodel
- 🤗Najii/Llama-Guard-3-8B-INT8model
- 🤗meta-llama/Llama-Guard-3-1Bmodel· 63k dl· ♡ 10363k dl♡ 103
- 🤗meta-llama/Llama-Guard-3-1B-INT4model· 30 dl· ♡ 2730 dl♡ 27
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
