Universal Jailbreak Suffixes Are Strong Attention Hijackers
Matan Ben-Tov, Mor Geva, Mahmood Sharif

TL;DR
This paper analyzes suffix-based jailbreak attacks on large language models, revealing that universal suffixes hijack attention more effectively and can be enhanced or mitigated with minimal computational overhead.
Contribution
It uncovers the mechanism behind the effectiveness of universal suffixes and demonstrates practical methods to improve or defend against such attacks.
Findings
Universal suffixes are more effective in hijacking attention.
Enhancing universality increases attack success up to 5 times.
Mitigation strategies can halve attack success with minimal utility loss.
Abstract
We study suffix-based jailbreaksa powerful family of attacks against large language models (LLMs) that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack, we observe that suffixes vary in efficacy: some are markedly more universalgeneralizing to many unseen harmful instructionsthan others. We first show that a shallow, critical mechanism drives GCG's effectiveness. This mechanism builds on the information flow from the adversarial suffix to the final chat template tokens before generation. Quantifying the dominance of this mechanism during generation, we find GCG irregularly and aggressively hijacks the contextualization process. Crucially, we tie hijacking to the universality phenomenon, with more universal suffixes being stronger hijackers. Subsequently, we show that these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDeception detection and forensic psychology · Digital and Cyber Forensics · Adversarial Robustness in Machine Learning
