Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing
Zehao Wang, Lanjun Wang

TL;DR
This paper introduces PRJA, a novel attack framework that manipulates the reasoning process of large models using semantic triggers and psychological framing, achieving high success rates in embedding harmful content.
Contribution
The work presents a new reasoning-targeted jailbreak attack method combining semantic trigger selection and psychological instruction generation, addressing challenges in maintaining answer integrity.
Findings
PRJA achieves an average attack success rate of 83.6% across multiple models.
The framework effectively embeds harmful content into reasoning steps without altering final answers.
Experiments demonstrate robustness against commercial large reasoning models.
Abstract
Large Reasoning Models (LRMs) have demonstrated strong capabilities in generating step-by-step reasoning chains alongside final answers, enabling their deployment in high-stakes domains such as healthcare and education. While prior jailbreak attack studies have focused on the safety of final answers, little attention has been given to the safety of the reasoning process. In this work, we identify a novel problem that injects harmful content into the reasoning steps while preserving unchanged answers. This type of attack presents two key challenges: 1) manipulating the input instructions may inadvertently alter the LRM's final answer, and 2) the diversity of input questions makes it difficult to consistently bypass the LRM's safety alignment mechanisms and embed harmful content into its reasoning process. To address these challenges, we propose the Psychology-based Reasoning-targeted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
