Open Problems and Fundamental Limitations of Reinforcement Learning from   Human Feedback

Stephen Casper; Xander Davies; Claudia Shi; Thomas Krendl Gilbert,; J\'er\'emy Scheurer; Javier Rando; Rachel Freedman; Tomasz Korbak; David; Lindner; Pedro Freire; Tony Wang; Samuel Marks; Charbel-Rapha\"el Segerie,; Micah Carroll; Andi Peng; Phillip Christoffersen; Mehul Damani; Stewart; Slocum; Usman Anwar; Anand Siththaranjan; Max Nadeau; Eric J. Michaud; Jacob; Pfau; Dmitrii Krasheninnikov; Xin Chen; Lauro Langosco; Peter Hase; Erdem; B{\i}y{\i}k; Anca Dragan; David Krueger; Dorsa Sadigh; Dylan Hadfield-Menell

arXiv:2307.15217·cs.AI·September 12, 2023·89 cites

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert,, J\'er\'emy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David, Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Rapha\"el Segerie,, Micah Carroll, Andi Peng, Phillip Christoffersen

PDF

Open Access

TL;DR

This paper surveys the open problems and fundamental limitations of reinforcement learning from human feedback (RLHF), emphasizing the need for improved oversight and complementary techniques to develop safer AI systems.

Contribution

It systematically analyzes RLHF's flaws, reviews methods to enhance it, and proposes standards for auditing and societal oversight.

Findings

01

RLHF has significant limitations that impact AI safety.

02

Current techniques to understand and improve RLHF are discussed.

03

Standards for auditing RLHF systems are proposed.

Abstract

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Reliability and Analysis Research

MethodsALIGN