Improving Safety Alignment via Balanced Direct Preference Optimization
Shiji Zhao, Mengyang Wang, Shukun Xiong, Fangzhou Chen, Qihui Zhu, Shouwei Ruan, Yisong Xiao, Ranjie Duan, Xun Chen, XingXing Wei

TL;DR
This paper introduces Balanced Direct Preference Optimization (B-DPO), a novel method that improves safety alignment in Large Language Models by addressing overfitting and preference comprehension issues, leading to better safety performance.
Contribution
The paper proposes B-DPO, an adaptive optimization method that balances preference learning, effectively mitigating overfitting and enhancing safety alignment in LLMs.
Findings
B-DPO improves safety capabilities over existing methods.
B-DPO maintains competitive general performance.
B-DPO addresses preference comprehension imbalance.
Abstract
With the rapid development and widespread application of Large Language Models (LLMs), their potential safety risks have attracted widespread attention. Reinforcement Learning from Human Feedback (RLHF) has been adopted to enhance the safety performance of LLMs. As a simple and effective alternative to RLHF, Direct Preference Optimization (DPO) is widely used for safety alignment. However, safety alignment still suffers from severe overfitting, which limits its actual performance. This paper revisits the overfitting phenomenon from the perspective of the model's comprehension of the training data. We find that the Imbalanced Preference Comprehension phenomenon exists between responses in preference pairs, which compromises the model's safety performance. To address this, we propose Balanced Direct Preference Optimization (B-DPO), which adaptively modulates optimization strength between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
