Self-Improving Robust Preference Optimization

Eugene Choi; Arash Ahmadian; Matthieu Geist; Oilvier Pietquin,; Mohammad Gheshlaghi Azar

arXiv:2406.01660·cs.LG·April 15, 2025

Self-Improving Robust Preference Optimization

Eugene Choi, Arash Ahmadian, Matthieu Geist, Oilvier Pietquin,, Mohammad Gheshlaghi Azar

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SRPO, a novel offline RLHF framework that enables models to self-improve from human preferences, improving alignment and robustness across tasks without task-specific tuning.

Contribution

SRPO formulates preference learning as a task-independent min-max optimization, reformulating it into a scalable, non-adversarial offline loss for effective self-improvement.

Findings

01

SRPO outperforms DPO by 15% in AI Win-Rate on XSum after 5 revisions.

02

SRPO achieves 56% Win-Rate on Arena-Hard prompts, surpassing DPO and IPO.

03

The method is robust and task-independent, enabling effective self-improvement in preference optimization.

Abstract

Online and offline RLHF methods, such as PPO and DPO, have been highly successful in aligning AI with human preferences. Despite their success, however, these methods suffer from fundamental limitations: (a) Models trained with RLHF can learn from mistakes or negative examples through RL mechanism or contrastive loss during training. However, at inference time, they lack an innate self-improvement mechanism for error corrections. (b) The optimal solution of existing methods is highly task-dependent, making it difficult for them to generalize to new tasks. To address these challenges, we propose Self-Improving Robust Preference Optimization (SRPO), a practical and mathematically principled offline RLHF framework. The key idea behind SRPO is to cast the problem of learning from human preferences as a self-improvement process, mathematically formulated as a min-max objective that jointly…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. This paper studies the critical problem of finding a more efficient RLHF algorithm. It could have great potential in many important real-world applications. 2. The idea of using a self-improving policy in solving the RLHF problem is novel. 3. The mathematical explanation of how to solve the min-max learning problem they propose is thorough. 3. The SRPO algorithm aims at training a robust policy against the quality of the dataset, making it more trustworthy than many other RLHF algorithms.

Weaknesses

1. There is no theoretical guarantee of the learning outcome. This makes the whole theoretical part weak. Is there a chance to provide any theoretical guarantee on the performance of the policy learned by SRPO under some assumptions? 2. The design of SRPO algorithm is novel, but the description of its motivation can be improved. Currently, the motivation is that 'Instead, it is more natural to learn that given a query x and a completion y what would be the improved completion upon y'. However,

Reviewer 02Rating 6Confidence 4

Strengths

The paper is clearly written and easy to follow, with each step of the SRPO algorithm systematically derived. The derivations provided in the paper appear rigorous, demonstrating a well-founded approach to preference optimization. The use of a min-max objective with a focus on self-improvement mechanisms is an interesting contribution.

Weaknesses

The evaluation is currently limited to a single dataset (TL;DR Summarization) and only compares SRPO against two baselines, DPO and IPO. Conducting experiments on additional datasets would strengthen the empirical claims related to robustness. The paper lacks (empirical) comparisons to more recent preference optimization methods, such as SimPO (Meng et al., 2024) and RPO (Liu et al., 2024), which integrate recent advancements over DPO. In Figure 3, SRPO achieves improved performance with revi

Reviewer 03Rating 6Confidence 3

Strengths

The observations made by this paper are well supported empirically. In particular, they use a nice synthetic bandit example to show robustness. Further, they experimentally verify on the TL;DR dataset and XSum on out of distribution examples. This paper further does a good job elucidating the derivation of the objective.

Weaknesses

- The proposed method empirically performs similar to other methods under the same amount of inference compute. - The paper does not provide much theoretical justification of the combination loss.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Criteria Decision Making

MethodsDirect Preference Optimization · Entropy Regularization · Proximal Policy Optimization