Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

Phuc Minh Nguyen; Ngoc-Hieu Nguyen; Duy H. M. Nguyen; Anji Liu; An Mai; Binh T. Nguyen; Daniel Sonntag; Khoa D. Doan

arXiv:2506.08681·cs.LG·June 12, 2025

Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

Phuc Minh Nguyen, Ngoc-Hieu Nguyen, Duy H. M. Nguyen, Anji Liu, An Mai, Binh T. Nguyen, Daniel Sonntag, Khoa D. Doan

PDF

Open Access 1 Repo

TL;DR

This paper introduces IS-DAAs, an importance sampling method that reduces over-optimization in direct alignment algorithms for language models, leading to improved alignment with human preferences.

Contribution

It proposes a novel importance-sampling technique with clipping to mitigate over-optimization in offline direct alignment algorithms, enhancing their performance.

Findings

01

IS-DAAs effectively reduce over-optimization in experiments.

02

The method outperforms existing approaches under low regularization.

03

Clipping importance ratios prevents high variance issues.

Abstract

Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO) have emerged as alternatives to the standard Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human values. However, these methods are more susceptible to over-optimization, in which the model drifts away from the reference policy, leading to degraded performance as training progresses. This paper proposes a novel importance-sampling approach to mitigate the over-optimization problem of offline DAAs. This approach, called (IS-DAAs), multiplies the DAA objective with an importance ratio that accounts for the reference policy distribution. IS-DAAs additionally avoid the high variance issue associated with importance sampling by clipping the importance ratio to a maximum value. Our extensive experiments demonstrate that IS-DAAs can effectively mitigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

duyhominhnguyen/is-daas
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Explainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics