Optimal Design for Reward Modeling in RLHF
Antoine Scheid, Etienne Boursier, Alain Durmus, Michael I. Jordan,, Pierre M\'enard, Eric Moulines, Michal Valko

TL;DR
This paper formalizes the reward training process in RLHF, framing dataset selection as a simple regret minimization problem, and provides theoretical bounds and guarantees for optimal reward modeling.
Contribution
It introduces a novel offline framework for reward model training in RLHF with theoretical guarantees, addressing the cost of collecting human preferences.
Findings
Derived bounds on simple regret for reward models
Proposed an offline approach for dataset selection in RLHF
Established a lower bound matching the upper bound
Abstract
Reinforcement Learning from Human Feedback (RLHF) has become a popular approach to align language models (LMs) with human preferences. This method involves collecting a large dataset of human pairwise preferences across various text generations and using it to infer (implicitly or explicitly) a reward model. Numerous methods have been proposed to learn the reward model and align a LM with it. However, the costly process of collecting human preferences has received little attention and could benefit from theoretical insights. This paper addresses this issue and aims to formalize the reward training model in RLHF. We frame the selection of an effective dataset as a simple regret minimization task, using a linear contextual dueling bandit method. Given the potentially large number of arms, this approach is more coherent than the best-arm identification setting. We then propose an offline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReal-time simulation and control systems · Safety Systems Engineering in Autonomy · Risk and Safety Analysis
MethodsSoftmax · Attention Is All You Need · ALIGN
