The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from   Human Feedback

Nathan Lambert; Roberto Calandra

arXiv:2311.00168·cs.LG·February 5, 2024·2 cites

The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback

Nathan Lambert, Roberto Calandra

PDF

Open Access

TL;DR

This paper discusses the problem of objective mismatch in reinforcement learning from human feedback (RLHF), which causes models to overoptimize or underperform, and proposes solutions to improve alignment with user instructions.

Contribution

The paper identifies the causes of objective mismatch in RLHF and reviews related literature, proposing solutions to enhance model alignment and performance.

Findings

01

Objective mismatch leads to overoptimization and safety issues.

02

Review of literature highlights causes of misalignment.

03

Proposed solutions aim to improve RLHF effectiveness.

Abstract

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings. RLHF proceeds as collecting human preference data, training a reward model on said data, and optimizing a base ML model with respect to said reward for extrinsic evaluation metrics (e.g. MMLU, GSM8k). RLHF relies on many assumptions about how the various pieces fit together, such as a reward model capturing human preferences and an RL optimizer extracting the right signal from a reward model. As the RLHF process involves many distinct design decisions, it is easy to assume that multiple processes are correlated and therefore numerically linked. This apparent correlation is often not true, where reward models are easily overoptimized or RL optimizers can reduce performance on tasks not modeled in the data. Notable manifestations of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsBalanced Selection