Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

Zehao Wang; Lanjun Wang

arXiv:2604.15725·cs.LG·April 20, 2026

Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

Zehao Wang, Lanjun Wang

PDF

TL;DR

This paper introduces PRJA, a novel attack framework that manipulates the reasoning process of large models using semantic triggers and psychological framing, achieving high success rates in embedding harmful content.

Contribution

The work presents a new reasoning-targeted jailbreak attack method combining semantic trigger selection and psychological instruction generation, addressing challenges in maintaining answer integrity.

Findings

01

PRJA achieves an average attack success rate of 83.6% across multiple models.

02

The framework effectively embeds harmful content into reasoning steps without altering final answers.

03

Experiments demonstrate robustness against commercial large reasoning models.

Abstract

Large Reasoning Models (LRMs) have demonstrated strong capabilities in generating step-by-step reasoning chains alongside final answers, enabling their deployment in high-stakes domains such as healthcare and education. While prior jailbreak attack studies have focused on the safety of final answers, little attention has been given to the safety of the reasoning process. In this work, we identify a novel problem that injects harmful content into the reasoning steps while preserving unchanged answers. This type of attack presents two key challenges: 1) manipulating the input instructions may inadvertently alter the LRM's final answer, and 2) the diversity of input questions makes it difficult to consistently bypass the LRM's safety alignment mechanisms and embed harmful content into its reasoning process. To address these challenges, we propose the Psychology-based Reasoning-targeted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.