SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

Geon-Hyeong Kim; Yu Jin Kim; Byoungjip Kim; Honglak Lee; Kyunghoon Bae; Youngsoo Jang; Moontae Lee

arXiv:2505.20065·cs.LG·March 5, 2026

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

Geon-Hyeong Kim, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, Youngsoo Jang, Moontae Lee

PDF

Open Access 3 Reviews

TL;DR

SafeDPO introduces a simple, theoretically grounded method for safety alignment in large language models that improves safety without sacrificing helpfulness, using preference data and minimal modifications.

Contribution

The paper presents SafeDPO, a novel, lightweight safety optimization approach that directly optimizes a safety-constrained objective without auxiliary models or complex pipelines.

Findings

01

SafeDPO achieves better safety-helpfulness trade-offs than existing methods.

02

It scales effectively to LLMs with up to 13 billion parameters.

03

The hyperparameter enhances safety while maintaining theoretical optimality.

Abstract

As Large Language Models (LLMs) are increasingly deployed in real-world applications, balancing helpfulness and safety has become a central challenge. A natural approach is to incorporate safety constraints into Reinforcement Learning from Human Feedback (RLHF), where recent studies have shown promising progress. However, these methods often rely on auxiliary networks or multi-stage pipelines, thereby increasing complexity. In this work, we revisit the original safety alignment objective and show that, under mild assumptions, it admits a closed-form optimal policy. We further derive a provably equivalent and tractable objective, enabling direct optimization. Building on this insight, we propose SafeDPO, a lightweight method that preserves the optimal solution of the underlying safety-constrained objective while requiring only one additional hyperparameter and minimal modifications to…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 8Confidence 2

Strengths

1. The derivation of a closed-form, constraint-equivalent objective (Eq. 9–12) is elegant and rigorously justified through formal propositions. The proofs (Appendix A) convincingly establish theoretical soundness, equivalence, and unbiasedness. 2. This work only requires a single-stage training compared to previous methods that might need training during muliple stages such as training reward and cost model or iteratively optimizing the objective. 3. The experiment results are comprehensive an

Weaknesses

1. Could the authors also show some failure cases of SafeDPO to better understand or conduct failure analysis on those unsafe responses? 2. The figures shown in the paper are relatively hard to read. The authors should consider using larger and thicker dots and lines in the figure.

Reviewer 02Rating 6Confidence 4

Strengths

+ Compared to SafeRLHF, a widely used baseline in safety alignment, SafeDPO offers greater efficiency. It significantly reduces the annotation cost for preference signals and the computational burden during training, while maintaining competitive alignment performance. + A fundamental distinction between SafeDPO and SafeRLHF lies in their optimization objectives: SafeDPO directly optimizes the exact objective in Eq. (6), whereas SafeRLHF relies on an approximation in Eq. (7). This direct optimiz

Weaknesses

+ **Sample Efficiency and Annotation Cost:** The paper's approach faces challenges in sample efficiency and data requirements. As indicated in Eq. (13), unsafe-unsafe pairs are discarded, which reduces the overall utilization of the available data. More critically, obtaining high-quality, informative safe-unsafe pairs for effective contrastive learning likely incurs significant additional annotation costs. For instance, it may require manually crafting safe responses to unsafe prompts. + **E

Reviewer 03Rating 8Confidence 5

Strengths

This work shows that, the LLM safety alignment can be achieved with a single optimization objective. The rigorous, well-structured proof supports this claim. Compared with prior approaches that require iterative training, this method needs only one training stage, which is more efficient and stable.

Weaknesses

1. $\textbf{Hyperparameter guidance.}$ The new safety margin parameter is appealing. However, a brief tuning guide (ranges, sensitivity across model sizes) would help practitioners reproduce the safety/helpfulness trade-off. 2. $\textbf{Qualitative analysis.}$ A few case studies would make the improvements more interpretable beyond aggregate metrics.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFormal Methods in Verification

MethodsDirect Preference Optimization