Reward Model Overoptimisation in Iterated RLHF

Lorenz Wolf; Robert Kirk; Mirco Musolesi

arXiv:2505.18126·cs.LG·September 30, 2025

Reward Model Overoptimisation in Iterated RLHF

Lorenz Wolf, Robert Kirk, Mirco Musolesi

PDF

3 Reviews

TL;DR

This paper investigates how reward model overoptimisation occurs in iterated RLHF, analyzing its dynamics and effects on model performance to improve stability and generalisability.

Contribution

It provides the first systematic analysis of overoptimisation in iterated RLHF, revealing how different initialisation strategies affect robustness and performance.

Findings

01

Overoptimisation decreases over iterations as reward models better approximate ground-truth preferences.

02

Performance gains diminish over successive iterations.

03

Reinitialising from the base policy is robust but limits optimisation flexibility.

Abstract

Reinforcement learning from human feedback (RLHF) is a widely used method for aligning large language models with human preferences. However, RLHF often suffers from reward model overoptimisation, in which models overfit to the reward function, resulting in non-generalisable policies that exploit the idiosyncrasies and peculiarities of the reward function. A common mitigation is iterated RLHF, in which reward models are repeatedly retrained with updated human feedback and policies are re-optimised. Despite its increasing adoption, the dynamics of overoptimisation in this setting remain poorly understood. In this work, we present the first comprehensive study of overoptimisation in iterated RLHF. We systematically analyse key design choices - how reward model training data is transferred across iterations, which reward function is used for optimisation, and how policies are initialised.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

1. The paper analyzed the overoptimization problem of iterative RLHF very thoroughly, including empirical study on its progression across multiple training rounds, the impact of key design choices like data aggregation and policy initialization, and the trade-offs between robustness and optimization flexibility. 2. The paper provides a novel, theoretical perspective to study overoptimization.

Weaknesses

1. The paper lacks testing on standard reward benchmarks. 2. The paper's content is not organized enough to understand the whole process of iterative RLHF design choices and evaluating overoptimization.

Reviewer 02Rating 4Confidence 4

Strengths

Well-scoped, decision-oriented study. The three knobs cover the practical choices teams actually debate; the recommendations are specific and replicable. Concatenating preference data clearly helps. Strong and consistent gains vs. take-last/sample, especially in mid-KL regions where overoptimization tends to bite. Policy resets matter. From-SFT avoids “digging the hole deeper”; recovering from an overoptimized policy is empirically hard—even with later iterations. Distributional metric. The M

Weaknesses

Gold-RM surrogate limits external validity. A single fixed “gold” RM (and one dataset) can imprint its biases; real human-in-the-loop dynamics might differ (drift, noise, inconsistency). Narrow task/model scope. Pythia-410M policies and 70M/160M RMs on AlpacaFarm only; conclusions might shift with stronger instruction-tuned policies, adversarial prompts, or safety domains. Compute accounting is thin. We don’t see wall-clock/GPU hours per iteration/choice, nor inference overhead for ensembles/W

Reviewer 03Rating 4Confidence 5

Strengths

1. Provides a detailed study of reward over-optimization, factorizing iterated RLHF into three stages and empirically exploring actionable components in each stage. 2. Introduces metrics such as *MMD* and *KL–reward curves* to analyze over-optimization phenomena. 3. Delivers thorough experimental analyses; the conclusions are insightful and offer practical guidance for related applications.

Weaknesses

1. Beyond proximity to the gold reward, the paper should report testset metrics (e.g., pairwise accuracy) for the proxy reward across iterations to provide more comparable evidence. 2. Although the gold and proxy rewards differ substantially in parameter count, report their performance on held-out test sets and on public benchmarks (e.g., RewardBench) may lead resuslt more clear. 3. Conclusions drawn from a single dataset may be biased; the paper should evaluate on more datasets and base models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.