Improving Safety Alignment via Balanced Direct Preference Optimization

Shiji Zhao; Mengyang Wang; Shukun Xiong; Fangzhou Chen; Qihui Zhu; Shouwei Ruan; Yisong Xiao; Ranjie Duan; Xun Chen; XingXing Wei

arXiv:2603.22829·cs.AI·March 25, 2026

Improving Safety Alignment via Balanced Direct Preference Optimization

Shiji Zhao, Mengyang Wang, Shukun Xiong, Fangzhou Chen, Qihui Zhu, Shouwei Ruan, Yisong Xiao, Ranjie Duan, Xun Chen, XingXing Wei

PDF

Open Access

TL;DR

This paper introduces Balanced Direct Preference Optimization (B-DPO), a novel method that improves safety alignment in Large Language Models by addressing overfitting and preference comprehension issues, leading to better safety performance.

Contribution

The paper proposes B-DPO, an adaptive optimization method that balances preference learning, effectively mitigating overfitting and enhancing safety alignment in LLMs.

Findings

01

B-DPO improves safety capabilities over existing methods.

02

B-DPO maintains competitive general performance.

03

B-DPO addresses preference comprehension imbalance.

Abstract

With the rapid development and widespread application of Large Language Models (LLMs), their potential safety risks have attracted widespread attention. Reinforcement Learning from Human Feedback (RLHF) has been adopted to enhance the safety performance of LLMs. As a simple and effective alternative to RLHF, Direct Preference Optimization (DPO) is widely used for safety alignment. However, safety alignment still suffers from severe overfitting, which limits its actual performance. This paper revisits the overfitting phenomenon from the perspective of the model's comprehension of the training data. We find that the Imbalanced Preference Comprehension phenomenon exists between responses in preference pairs, which compromises the model's safety performance. To address this, we propose Balanced Direct Preference Optimization (B-DPO), which adaptively modulates optimization strength between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)