Universal Jailbreak Suffixes Are Strong Attention Hijackers

Matan Ben-Tov; Mor Geva; Mahmood Sharif

arXiv:2506.12880·cs.CR·December 23, 2025

Universal Jailbreak Suffixes Are Strong Attention Hijackers

Matan Ben-Tov, Mor Geva, Mahmood Sharif

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper analyzes suffix-based jailbreak attacks on large language models, revealing that universal suffixes hijack attention more effectively and can be enhanced or mitigated with minimal computational overhead.

Contribution

It uncovers the mechanism behind the effectiveness of universal suffixes and demonstrates practical methods to improve or defend against such attacks.

Findings

01

Universal suffixes are more effective in hijacking attention.

02

Enhancing universality increases attack success up to 5 times.

03

Mitigation strategies can halve attack success with minimal utility loss.

Abstract

We study suffix-based jailbreaks $\unicode x 2013$ a powerful family of attacks against large language models (LLMs) that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack, we observe that suffixes vary in efficacy: some are markedly more universal $\unicode x 2013$ generalizing to many unseen harmful instructions $\unicode x 2013$ than others. We first show that a shallow, critical mechanism drives GCG's effectiveness. This mechanism builds on the information flow from the adversarial suffix to the final chat template tokens before generation. Quantifying the dominance of this mechanism during generation, we find GCG irregularly and aggressively hijacks the contextualization process. Crucially, we tie hijacking to the universality phenomenon, with more universal suffixes being stronger hijackers. Subsequently, we show that these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

matanbt/interp-jailbreak
jaxOfficial

Datasets

MatanBT/gcg-evaluated-data
dataset· 112 dl
112 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDeception detection and forensic psychology · Digital and Cyber Forensics · Adversarial Robustness in Machine Learning