Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

Zheng Lin; Zhenxing Niu; Haoxuan Ji; Yuzhe Huang; Haichang Gao

arXiv:2605.19485·cs.AI·May 20, 2026

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang, Haichang Gao

PDF

TL;DR

This paper explores how attention patterns in large reasoning models influence jailbreak success and proposes a reinforcement learning-based method that uses attention signals to improve attack effectiveness.

Contribution

It introduces a novel RL-based jailbreak approach that leverages attention patterns and diverse strategies to significantly increase attack success rates on LRMs.

Findings

01

Jailbreak success correlates with attention to harmful tokens.

02

The proposed method outperforms existing approaches in effectiveness.

03

Diverse persuasion strategies improve attack transferability.

Abstract

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a model's internal reasoning process introduces additional safety risks; for example, recent studies show that LRMs are more vulnerable to jailbreak attacks than standard LLMs. In this paper, we investigate jailbreak attacks on LRMs and reveal that the attack success rate (ASR) is closely correlated with LRMs' attention patterns. Specifically, successful jailbreaks tend to assign lower attention to harmful tokens in the input prompt, while allocating higher attention to those tokens in the reasoning content. Motivated by this finding, we propose a novel jailbreak method for LRMs that leverages reinforcement learning (RL) to enhance attack effectiveness, explicitly incorporating attention signals into the reward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.