The History and Risks of Reinforcement Learning and Human Feedback
Nathan Lambert, Thomas Krendl Gilbert, Tom Zick

TL;DR
This paper reviews the development, challenges, and sociotechnical aspects of reinforcement learning from human feedback (RLHF), emphasizing the need for transparency and further research into reward models used in large language models.
Contribution
It provides a comprehensive historical and conceptual analysis of RLHF, highlighting methodological tensions and proposing research directions to better understand reward models.
Findings
RLHF reward models are central but poorly understood.
There are ontological differences between costs, rewards, and preferences.
Transparency and further study are crucial for advancing RLHF understanding.
Abstract
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) easier to use and more effective. A core piece of the RLHF process is the training and utilization of a model of human preferences that acts as a reward function for optimization. This approach, which operates at the intersection of many stakeholders and academic disciplines, remains poorly understood. RLHF reward models are often cited as being central to achieving performance, yet very few descriptors of capabilities, evaluations, training methods, or open-source models exist. Given this lack of information, further study and transparency is needed for learned RLHF reward models. In this paper, we illustrate the complex history of optimizing preferences, and articulate lines of inquiry to understand the sociotechnical context of reward models. In particular, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research
