RAJ-PGA: Reasoning-Activated Jailbreak and Principle-Guided Alignment Framework for Large Reasoning Models
Jianhao Chen, Mayi Xu, Haoyang Chen, Xiaohu Li, Xiangyu Zhang, Jianjie Huang, Zheng Wang, Xiaochun Cao, Tieyun Qian

TL;DR
This paper introduces RAJ-PGA, a framework combining a novel attack method and a safety alignment dataset to improve the safety of large reasoning models without compromising their reasoning abilities.
Contribution
The paper presents a new attack paradigm, RAJ, and a scalable safety alignment framework using the PGA dataset to enhance model safety against reasoning-based jailbreaks.
Findings
Significantly improves defense success rates by up to 29.5%
Effectively mitigates reasoning-based jailbreak attacks
Preserves and enhances reasoning capabilities
Abstract
Large Reasoning Models (LRMs) face a distinct safety vulnerability: their internal reasoning chains may generate harmful content even when the final output appears benign. To address this overlooked risk, we first propose a novel attack paradigm, Reasoning-Activated Jailbreak (RAJ) via Concretization, which demonstrates that refining malicious prompts to be more specific can trigger step-by-step logical reasoning that overrides the model's safety protocols. To systematically mitigate this vulnerability, we further develop a scalable framework for constructing high-quality safety alignment datasets. This framework first leverages the RAJ attack to elicit challenging harmful reasoning chains from LRMs, then transforms these high-risk traces into safe, constructive, and educational responses through a tailored Principle-Guided Alignment (PGA) mechanism. Then, we introduce the PGA dataset,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFuzzy Logic and Control Systems · Software Reliability and Analysis Research · Infrastructure Maintenance and Monitoring
