Jailbreak Transferability Emerges from Shared Representations

Rico Angell; Jannik Brinkmann; He He

arXiv:2506.12913·cs.LG·October 30, 2025

Jailbreak Transferability Emerges from Shared Representations

Rico Angell, Jannik Brinkmann, He He

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that jailbreak transferability between models is primarily due to shared representations, with transferability increasing alongside representational similarity and attack strength, highlighting the role of representation alignment.

Contribution

The study provides causal evidence that increasing representational similarity between models enhances jailbreak transferability, emphasizing the importance of shared representations in attack generalization.

Findings

01

Transferability correlates with representational similarity.

02

Increasing similarity via distillation causally boosts transfer.

03

Natural-language attacks transfer more than cipher-based ones.

Abstract

Jailbreak transferability is the surprising phenomenon when an adversarial attack compromising one model also elicits harmful responses from other models. Despite widespread demonstrations, there is little consensus on why transfer is possible: is it a quirk of safety training, an artifact of model families, or a more fundamental property of representation learning? We present evidence that transferability emerges from shared representations rather than incidental flaws. Across 20 open-weight models and 33 jailbreak attacks, we find two factors that systematically shape transfer: (1) representational similarity under benign prompts, and (2) the strength of the jailbreak on the source model. To move beyond correlation, we show that deliberately increasing similarity through benign only distillation causally increases transfer. Our qualitative analyses reveal systematic transferability…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1.This paper provides a new and important perspective on jailbreak transfer, framing it as a fundamental consequence of shared representations. 2.The authors' novel distillation experiment provides powerful causal evidence for their claims, which is a very convincing and methodologically sound approach. 3.The large-scale quantitative analysis across 20 models makes the observed correlation between similarity and transferability highly robust and credible.

Weaknesses

1. The paper's causal claim relies on the assumption that benign-only distillation solely increases representational similarity. However, this fine-tuning process could also inadvertently degrade the student model's general capabilities and original safety training effectiveness, which could also explain its increased vulnerability. 2. The study's conclusions are dependent on the chosen mutual k-nearest neighbors metric for measuring similarity. Could the authors explain why they chose this met

Reviewer 02Rating 6Confidence 4

Strengths

1. The study compiles the largest known dataset of jailbreak transfer evaluations (20 models × 33 attacks × 313 prompts). 2. The authors carefully control for attack strength, preventing the common confounding that “stronger attacks transfer more.” 3. The benign-only distillation protocol provides a safe, controlled method to causally test representational effects without ethical concerns. 4. The results offer a clear practical implication for defense research: surface-level alignment is insuffi

Weaknesses

1. Limited scope of distillation: Only three teacher–student pairs, trained for one epoch under a single hyperparameter configuration. 2. Limited practical impact: Even after distillation, transfer rates in large models remain low (≤10% in Fig. 6), somewhat constraining real-world significance.

Reviewer 03Rating 4Confidence 3

Strengths

1. Clear expression and fluent writing. 2. Rigorous evaluation using the StrongREJECT judge. 3. Aboundant experiments on open-weight models and different jailbreak attacks. 4. Validated the effectiveness of distillation techniques in enhancing the transferability of jailbreak prompts.

Weaknesses

1. Many of the findings in this paper are quite basic and not new to me, as similar conclusions have appeared in other fields. For instance, distillation has been shown to improve model similarity [1], enhance black-box transferability [2-3], and boost robustness [4-5]. Additionally, a recent paper on LLM attacks found that distillation can improve the effectiveness of prompt attacks on smaller models [6], which overlaps significantly with the core conclusion of this paper. Personally, this pape

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Adversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications