Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Zhihan Liu, Miao Lu, Shenao Zhang, Boyi Liu, Hongyi Guo, Yingxiang, Yang, Jose Blanchet, Zhaoran Wang

TL;DR
This paper introduces a theoretically grounded algorithm called RPO that combines preference optimization and supervised fine-tuning to mitigate overoptimization in RLHF for aligning large language models, with proven efficiency and empirical success.
Contribution
It provides a novel theoretical framework and practical algorithm that explicitly mitigates overoptimization in RLHF by combining preference optimization with supervised learning.
Findings
RPO outperforms DPO in aligning LLMs.
The algorithm offers provable sample efficiency.
Empirical results show improved response quality.
Abstract
Aligning generative models with human preference via RLHF typically suffers from overoptimization, where an imperfectly learned reward model can misguide the generative model to output undesired responses. We investigate this problem in a principled manner by identifying the source of the misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model; one that simultaneously minimizes the maximum likelihood estimation of the loss and a reward penalty term. Here, the reward penalty term is introduced to prevent the policy from choosing actions with spurious high proxy rewards, resulting in provable sample efficiency of the algorithm under a partial coverage style condition. Moving from theory to practice, the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Filter Design and Implementation · Numerical Methods and Algorithms · Model Reduction and Neural Networks
MethodsDirect Preference Optimization · Shrink and Fine-Tune
