RAJ-PGA: Reasoning-Activated Jailbreak and Principle-Guided Alignment Framework for Large Reasoning Models

Jianhao Chen; Mayi Xu; Haoyang Chen; Xiaohu Li; Xiangyu Zhang; Jianjie Huang; Zheng Wang; Xiaochun Cao; Tieyun Qian

arXiv:2508.12897·cs.AI·January 1, 2026

RAJ-PGA: Reasoning-Activated Jailbreak and Principle-Guided Alignment Framework for Large Reasoning Models

Jianhao Chen, Mayi Xu, Haoyang Chen, Xiaohu Li, Xiangyu Zhang, Jianjie Huang, Zheng Wang, Xiaochun Cao, Tieyun Qian

PDF

Open Access

TL;DR

This paper introduces RAJ-PGA, a framework combining a novel attack method and a safety alignment dataset to improve the safety of large reasoning models without compromising their reasoning abilities.

Contribution

The paper presents a new attack paradigm, RAJ, and a scalable safety alignment framework using the PGA dataset to enhance model safety against reasoning-based jailbreaks.

Findings

01

Significantly improves defense success rates by up to 29.5%

02

Effectively mitigates reasoning-based jailbreak attacks

03

Preserves and enhances reasoning capabilities

Abstract

Large Reasoning Models (LRMs) face a distinct safety vulnerability: their internal reasoning chains may generate harmful content even when the final output appears benign. To address this overlooked risk, we first propose a novel attack paradigm, Reasoning-Activated Jailbreak (RAJ) via Concretization, which demonstrates that refining malicious prompts to be more specific can trigger step-by-step logical reasoning that overrides the model's safety protocols. To systematically mitigate this vulnerability, we further develop a scalable framework for constructing high-quality safety alignment datasets. This framework first leverages the RAJ attack to elicit challenging harmful reasoning chains from LRMs, then transforms these high-risk traces into safe, constructive, and educational responses through a tailored Principle-Guided Alignment (PGA) mechanism. Then, we introduce the PGA dataset,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFuzzy Logic and Control Systems · Software Reliability and Analysis Research · Infrastructure Maintenance and Monitoring