Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Mengxuan Hu; Vivek V. Datla; Anoop Kumar; Zihan Guan; Sheng Li; Alfy Samuel; Daben Liu

arXiv:2602.21346·cs.CL·February 26, 2026

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Mengxuan Hu, Vivek V. Datla, Anoop Kumar, Zihan Guan, Sheng Li, Alfy Samuel, Daben Liu

PDF

Open Access

TL;DR

This paper introduces Alignment-Weighted DPO, a novel fine-tuning method that enhances large language model safety by focusing on reasoning segments, thereby improving robustness against jailbreak attacks while preserving utility.

Contribution

It proposes a new alignment training approach that assigns differential weights to reasoning and answer segments, improving safety and robustness of LLMs against deceptive prompts.

Findings

01

Enhanced safety robustness against jailbreak strategies.

02

Maintained high utility performance of models.

03

Outperformed standard fine-tuning baselines.

Abstract

Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable to jailbreak attacks that disguise harmful intent through indirect or deceptive phrasing. Using causal intervention, we empirically demonstrate that this vulnerability stems from shallow alignment mechanisms that lack deep reasoning, often rejecting harmful prompts without truly understanding why they are harmful. To mitigate this vulnerability, we propose enhancing alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)