The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

Yonghong Deng; Zhen Yang; Ping Jian; Xinyue Zhang; Zhongbin Guo; Chengzhi Li

arXiv:2603.08234·cs.AI·March 10, 2026

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

Yonghong Deng, Zhen Yang, Ping Jian, Xinyue Zhang, Zhongbin Guo, Chengzhi Li

PDF

Open Access

TL;DR

This paper investigates the mechanisms behind continuation-triggered jailbreaks in large language models, revealing how attention head competition between continuation drive and safety defenses causes vulnerabilities, with implications for improving model safety.

Contribution

It provides a mechanistic interpretability analysis of jailbreak behavior, identifying attention head competition as a key factor and analyzing safety head functions across architectures.

Findings

01

Jailbreak success increases with instruction suffix relocation.

02

Attention head competition underlies jailbreak behavior.

03

Safety head functions vary across model architectures.

Abstract

With the rapid advancement of large language models (LLMs), the safety of LLMs has become a critical concern. Despite significant efforts in safety alignment, current LLMs remain vulnerable to jailbreaking attacks. However, the root causes of such vulnerabilities are still poorly understood, necessitating a rigorous investigation into jailbreak mechanisms across both academic and industrial communities. In this work, we focus on a continuation-triggered jailbreak phenomenon, whereby simply relocating a continuation-triggered instruction suffix can substantially increase jailbreak success rates. To uncover the intrinsic mechanisms of this phenomenon, we conduct a comprehensive mechanistic interpretability analysis at the level of attention heads. Through causal interventions and activation scaling, we show that this jailbreak behavior primarily arises from an inherent competition between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Information and Cyber Security · Advanced Malware Detection Techniques