TL;DR
This paper investigates how reward model overoptimisation occurs in iterated RLHF, analyzing its dynamics and effects on model performance to improve stability and generalisability.
Contribution
It provides the first systematic analysis of overoptimisation in iterated RLHF, revealing how different initialisation strategies affect robustness and performance.
Findings
Overoptimisation decreases over iterations as reward models better approximate ground-truth preferences.
Performance gains diminish over successive iterations.
Reinitialising from the base policy is robust but limits optimisation flexibility.
Abstract
Reinforcement learning from human feedback (RLHF) is a widely used method for aligning large language models with human preferences. However, RLHF often suffers from reward model overoptimisation, in which models overfit to the reward function, resulting in non-generalisable policies that exploit the idiosyncrasies and peculiarities of the reward function. A common mitigation is iterated RLHF, in which reward models are repeatedly retrained with updated human feedback and policies are re-optimised. Despite its increasing adoption, the dynamics of overoptimisation in this setting remain poorly understood. In this work, we present the first comprehensive study of overoptimisation in iterated RLHF. We systematically analyse key design choices - how reward model training data is transferred across iterations, which reward function is used for optimisation, and how policies are initialised.…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper analyzed the overoptimization problem of iterative RLHF very thoroughly, including empirical study on its progression across multiple training rounds, the impact of key design choices like data aggregation and policy initialization, and the trade-offs between robustness and optimization flexibility. 2. The paper provides a novel, theoretical perspective to study overoptimization.
1. The paper lacks testing on standard reward benchmarks. 2. The paper's content is not organized enough to understand the whole process of iterative RLHF design choices and evaluating overoptimization.
Well-scoped, decision-oriented study. The three knobs cover the practical choices teams actually debate; the recommendations are specific and replicable. Concatenating preference data clearly helps. Strong and consistent gains vs. take-last/sample, especially in mid-KL regions where overoptimization tends to bite. Policy resets matter. From-SFT avoids “digging the hole deeper”; recovering from an overoptimized policy is empirically hard—even with later iterations. Distributional metric. The M
Gold-RM surrogate limits external validity. A single fixed “gold” RM (and one dataset) can imprint its biases; real human-in-the-loop dynamics might differ (drift, noise, inconsistency). Narrow task/model scope. Pythia-410M policies and 70M/160M RMs on AlpacaFarm only; conclusions might shift with stronger instruction-tuned policies, adversarial prompts, or safety domains. Compute accounting is thin. We don’t see wall-clock/GPU hours per iteration/choice, nor inference overhead for ensembles/W
1. Provides a detailed study of reward over-optimization, factorizing iterated RLHF into three stages and empirically exploring actionable components in each stage. 2. Introduces metrics such as *MMD* and *KL–reward curves* to analyze over-optimization phenomena. 3. Delivers thorough experimental analyses; the conclusions are insightful and offer practical guidance for related applications.
1. Beyond proximity to the gold reward, the paper should report testset metrics (e.g., pairwise accuracy) for the proxy reward across iterations to provide more comparable evidence. 2. Although the gold and proxy rewards differ substantially in parameter count, report their performance on held-out test sets and on public benchmarks (e.g., RewardBench) may lead resuslt more clear. 3. Conclusions drawn from a single dataset may be biased; the paper should evaluate on more datasets and base models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
