Optimal Design for Reward Modeling in RLHF

Antoine Scheid; Etienne Boursier; Alain Durmus; Michael I. Jordan,; Pierre M\'enard; Eric Moulines; Michal Valko

arXiv:2410.17055·cs.LG·October 24, 2024

Optimal Design for Reward Modeling in RLHF

Antoine Scheid, Etienne Boursier, Alain Durmus, Michael I. Jordan,, Pierre M\'enard, Eric Moulines, Michal Valko

PDF

Open Access 1 Datasets

TL;DR

This paper formalizes the reward training process in RLHF, framing dataset selection as a simple regret minimization problem, and provides theoretical bounds and guarantees for optimal reward modeling.

Contribution

It introduces a novel offline framework for reward model training in RLHF with theoretical guarantees, addressing the cost of collecting human preferences.

Findings

01

Derived bounds on simple regret for reward models

02

Proposed an offline approach for dataset selection in RLHF

03

Established a lower bound matching the upper bound

Abstract

Reinforcement Learning from Human Feedback (RLHF) has become a popular approach to align language models (LMs) with human preferences. This method involves collecting a large dataset of human pairwise preferences across various text generations and using it to infer (implicitly or explicitly) a reward model. Numerous methods have been proposed to learn the reward model and align a LM with it. However, the costly process of collecting human preferences has received little attention and could benefit from theoretical insights. This paper addresses this issue and aims to formalize the reward training model in RLHF. We frame the selection of an effective dataset as a simple regret minimization task, using a linear contextual dueling bandit method. Given the potentially large number of arms, this approach is more coherent than the best-arm identification setting. We then propose an offline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

misovalko/my-research-papers
dataset· 21 dl
21 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReal-time simulation and control systems · Safety Systems Engineering in Autonomy · Risk and Safety Analysis

MethodsSoftmax · Attention Is All You Need · ALIGN