Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking

Junda Zhu; Lingyong Yan; Shuaiqiang Wang; Dawei Yin; Lei Sha

arXiv:2502.12970·cs.CL·September 23, 2025

Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking

Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, Lei Sha

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces Reasoning-to-Defend, a safety-aware reasoning training paradigm for large language models that enhances their ability to defend against jailbreak attacks while maintaining performance.

Contribution

It proposes a novel safety-aware reasoning mechanism and Contrastive Pivot Optimization to improve LLMs' safety and robustness against jailbreaks.

Findings

01

R2D significantly mitigates jailbreak attacks.

02

Models maintain original performance levels.

03

Enhanced safety perception improves robustness.

Abstract

Large Reasoning Models (LRMs) have recently demonstrated impressive performances across diverse domains. However, how the safety of Large Language Models (LLMs) benefits from enhanced reasoning capabilities against jailbreak queries remains unexplored. To bridge this gap, in this paper, we propose Reasoning-to-Defend (R2D), a novel training paradigm that integrates a safety-aware reasoning mechanism into LLMs' generation process. This enables self-evaluation at each step of the reasoning process, forming safety pivot tokens as indicators of the safety status of responses. Furthermore, in order to improve the accuracy of predicting pivot tokens, we propose Contrastive Pivot Optimization (CPO), which enhances the model's perception of the safety status of given dialogues. LLMs dynamically adjust their response strategies during reasoning, significantly enhancing their safety capabilities…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chuhac/Reasoning-to-Defend
pytorchOfficial

Datasets

chuhac/R2D-R1
dataset· 122 dl
122 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics