Unifying Stable Optimization and Reference Regularization in RLHF
Li He, Qiang Qu, He Zhao, Stephen Wan, Dadong Wang, Lina Yao, Tongliang Liu

TL;DR
This paper proposes a unified regularization method for RLHF that balances preventing reward hacking and ensuring stable policy updates, leading to improved alignment and stability across benchmarks.
Contribution
It introduces a novel unified regularization approach that explicitly balances reward hacking prevention and stable optimization in RLHF, improving performance and simplicity.
Findings
Outperforms existing RLHF methods on multiple benchmarks.
Enhances alignment quality and training stability.
Simplifies implementation compared to separate regularization strategies.
Abstract
Reinforcement Learning from Human Feedback (RLHF) has advanced alignment capabilities significantly but remains hindered by two core challenges: \textbf{reward hacking} and \textbf{stable optimization}. Current solutions independently address these issues through separate regularization strategies, specifically a KL-divergence penalty against a supervised fine-tuned model () to mitigate reward hacking, and policy ratio clipping towards the current policy () to promote stable alignment. However, the implicit trade-off arising from simultaneously regularizing towards both and remains under-explored. In this paper, we introduce a unified regularization approach that explicitly balances the objectives of preventing reward hacking and maintaining stable policy updates. Our simple yet principled alignment objective yields a weighted supervised fine-tuning loss…
Peer Reviews
Decision·ICLR 2026 Poster
The paper addresses a critical problem that has not been explored in the literature, the impact of regularizing towards both the reference policy and the current policy, in RLHF. Regularizing towards the reference policy is done by a KL penalty to the reward, and regulazing towards the current policy is achieved via clipping or a KL constraint. Together, these two constraint our objective to operate in the intersection of the trust region that becomes increasingly restrictive as training progres
One of my concerns with the paper is their choice to regularize towards a convex combination of the reference policy $\pi_{0}$ and the current policy $\pi_{t}$ i.e $\alpha D_{KL}(\pi \vert\vert \pi_{0}) + (1-\alpha) D_{KL}(\pi \vert\vert \pi_{t})$. This inherently leads to incentivizing regularizing to one of the distibutions than the other (when $\alpha$ != 0.5). It would have been better to have two independent multipliers for each of the divergence, which supports the Lagrangian view of the o
1. Clear identification of a long-standing conflict between stability and reference regularization. 2. Mathematical formulation is elegant and internally consistent. 3. DAR simplifies PPO-style RLHF into a regression-like loss that is easier to implement and more stable.
1. Outdated baseline setup. All experiments use Qwen2-7B and compare mainly against PPO, GRPO, and RLOO; no comparison to modern alignment frameworks, stronger models, and new RL methods. 2. The novelty is mostly formal: the “dual-KL” is effectively a convex interpolation between π₀ and πₜ, similar to prior multi-reference ideas. 3. Theoretical results rely on clean advantage estimation; no analysis under noisy or biased rewards. 4. Empirical gains are modest and might vanish under stronger base
- **Timely problem framing.** The paper clearly motivates the dual goals of stabilizing policy updates while constraining drift from a reference policy, and argues for a unified objective rather than separate mechanisms. - **Empirical gains.** Across three benchmarks, DAR shows strong win rates against online RLHF (e.g., PPO, GRPO, RLOO) and online DAP baselines; curves in Figure 3 and the summary in Table 2 support the claim. - **Implementation clarity intent.** The paper points to code/supplem
- In the discussion of PPO stability, the classical TRPO/PPO literature typically regularizes with a KL of the form $D_{\mathrm{KL}}\left[\pi_{\text{old}}\||\pi_{\theta}\right]$, see Schulman et al. (2015, TRPO) and Schulman et al. (2017, PPO). By contrast, the paper’s *PPO-Align* (Sec. 4.1) constrains with $D_{\mathrm{KL}}\left[\pi_{\theta}\||\pi_{t}\right]$ and penalizes $D_{\mathrm{KL}}\left[\pi_{\theta} \||\pi_{0}\right]$. The rationale for changing directions relative to the classical trust
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Recommender Systems and Techniques · Emotion and Mood Recognition
