Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models
Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang, Haichang Gao

TL;DR
This paper explores how attention patterns in large reasoning models influence jailbreak success and proposes a reinforcement learning-based method that uses attention signals to improve attack effectiveness.
Contribution
It introduces a novel RL-based jailbreak approach that leverages attention patterns and diverse strategies to significantly increase attack success rates on LRMs.
Findings
Jailbreak success correlates with attention to harmful tokens.
The proposed method outperforms existing approaches in effectiveness.
Diverse persuasion strategies improve attack transferability.
Abstract
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a model's internal reasoning process introduces additional safety risks; for example, recent studies show that LRMs are more vulnerable to jailbreak attacks than standard LLMs. In this paper, we investigate jailbreak attacks on LRMs and reveal that the attack success rate (ASR) is closely correlated with LRMs' attention patterns. Specifically, successful jailbreaks tend to assign lower attention to harmful tokens in the input prompt, while allocating higher attention to those tokens in the reasoning content. Motivated by this finding, we propose a novel jailbreak method for LRMs that leverages reinforcement learning (RL) to enhance attack effectiveness, explicitly incorporating attention signals into the reward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
