Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback
Yikai Wang, Shang Liu, Jose Blanchet

TL;DR
This paper introduces Wasserstein distributionally robust regret optimization (DRRO) for reinforcement learning from human feedback, effectively reducing reward over-optimization and improving alignment with true human utility.
Contribution
It proposes a novel DRRO framework that optimizes worst-case regret, with a practical policy-gradient algorithm and theoretical insights into its advantages over standard DRO.
Findings
DRRO mitigates over-optimization more effectively than existing methods.
The inner worst-case regret admits an exact solution with a water-filling structure.
Experiments demonstrate DRRO's superior performance in alignment tasks.
Abstract
Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting gap leads to reward over-optimization, or Goodharting, where proxy reward continues to improve even after true quality deteriorates. Existing mitigations address this problem through uncertainty penalties, pessimistic rewards, or conservative constraints, but they can be computationally burdensome and overly pessimistic. We propose Wasserstein distributionally robust regret optimization (DRRO) for RLHF. Instead of pessimizing worst-case value…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
