Provably Mitigating Overoptimization in RLHF: Your SFT Loss is   Implicitly an Adversarial Regularizer

Zhihan Liu; Miao Lu; Shenao Zhang; Boyi Liu; Hongyi Guo; Yingxiang; Yang; Jose Blanchet; Zhaoran Wang

arXiv:2405.16436·cs.LG·December 5, 2024

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Zhihan Liu, Miao Lu, Shenao Zhang, Boyi Liu, Hongyi Guo, Yingxiang, Yang, Jose Blanchet, Zhaoran Wang

PDF

Open Access 1 Models

TL;DR

This paper introduces a theoretically grounded algorithm called RPO that combines preference optimization and supervised fine-tuning to mitigate overoptimization in RLHF for aligning large language models, with proven efficiency and empirical success.

Contribution

It provides a novel theoretical framework and practical algorithm that explicitly mitigates overoptimization in RLHF by combining preference optimization with supervised learning.

Findings

01

RPO outperforms DPO in aligning LLMs.

02

The algorithm offers provable sample efficiency.

03

Empirical results show improved response quality.

Abstract

Aligning generative models with human preference via RLHF typically suffers from overoptimization, where an imperfectly learned reward model can misguide the generative model to output undesired responses. We investigate this problem in a principled manner by identifying the source of the misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model; one that simultaneously minimizes the maximum likelihood estimation of the loss and a reward penalty term. Here, the reward penalty term is introduced to prevent the policy from choosing actions with spurious high proxy rewards, resulting in provable sample efficiency of the algorithm under a partial coverage style condition. Moving from theory to practice, the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ZHLiu627/zephyr-7b-gemma-rpo-avg
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Filter Design and Implementation · Numerical Methods and Algorithms · Model Reduction and Neural Networks

MethodsDirect Preference Optimization · Shrink and Fine-Tune