Understanding and Enhancing the Transferability of Jailbreaking Attacks

Runqi Lin; Bo Han; Fengwang Li; Tongling Liu

arXiv:2502.03052·cs.LG·May 20, 2025

Understanding and Enhancing the Transferability of Jailbreaking Attacks

Runqi Lin, Bo Han, Fengwang Li, Tongling Liu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper investigates the transferability of jailbreaking attacks on large language models, revealing their limitations and proposing a new method, PiF, to improve attack transferability and model evaluation.

Contribution

The paper introduces the PiF method, which enhances attack transferability by dispersing model focus, addressing overfitting issues in adversarial sequences for better LLM vulnerability assessment.

Findings

01

Adversarial sequences overfit source LLMs, limiting transferability.

02

PiF effectively disperses focus, improving attack transferability.

03

PiF enhances red-teaming evaluation of proprietary LLMs.

Abstract

Jailbreaking attacks can effectively manipulate open-source large language models (LLMs) to produce harmful responses. However, these attacks exhibit limited transferability, failing to disrupt proprietary LLMs consistently. To reliably identify vulnerabilities in proprietary LLMs, this work investigates the transferability of jailbreaking attacks by analysing their impact on the model's intent perception. By incorporating adversarial sequences, these attacks can redirect the source LLM's focus away from malicious-intent tokens in the original input, thereby obstructing the model's intent recognition and eliciting harmful responses. Nevertheless, these adversarial sequences fail to mislead the target LLM's intent perception, allowing the target LLM to refocus on malicious-intent tokens and abstain from responding. Our analysis further reveals the inherent distributional dependency…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper introduces a novel and efficient method to reduce LLMs' focus on malicious-intent tokens, thus bypassing the safety alignment. 2. The PiF method demonstrates high transferability across various target LLMs while maintaining a consistently high ASR. 3. Compared to existing baselines, PiF uses fewer tokens in the adversarial prompt and requires less optimization time.

Weaknesses

1. The paper lacks comparisons with more recent baselines [1, 2, 3], which have been shown to be more effective and efficient than GCG and PAIR. 2. The authors should clarify the detailed implementation of the perplexity and instruction filters used in the paper, either in Section 5.2 or the appendix. For example, in Table 3, while PiF's perplexity is lower than GCG's, it is 9x higher than PAIR's, suggesting that a suitable perplexity threshold could filter out PiF's prompts. Listing the perform

Reviewer 02Rating 6Confidence 4

Strengths

1. Interesting observation about the distributional dependency. 2. Experiments show higher transferability than baselines.

Weaknesses

1. Only one large commercial model is evaluated. It's suggested to include Gemini and Claude 3.5. 2. It lacks human study to prove that the generated results are harmful and truly provide the correct and useful answers. 3. This word substitution seems not robust against random drop [1] and grammar correction because the generated sentence contains typos such as "and the build a bomb". [1] Robey, Alexander, et al. "Smoothllm: Defending large language models against jailbreaking attacks." arXiv p

Reviewer 03Rating 6Confidence 4

Strengths

1. Jailbreaking attacks are an important research direction for understanding the vulnerability of LLMs. This paper studies the transferability of existing attacks, which is critical for evaluating the robustness of LLMs when no access to their weight parameters. 2. The study on the importance of each token with respect to the model under investigation is very interesting. It seems that using the template can exact attention-like information from LLMs. This is an interesting method and may be g

Weaknesses

1. The paper focuses on the transferability of jailbreaking attacks. The original paper of GCG has also conducted such an evaluation, where it crafts an adversarial suffix on multiple LLMs and then tests the attack on other models. However, such an experiment is not conducted in the paper. GCG has shown reasonable transferability across various LLMs. It is critical to conduct the comparison with these attacks on top of the base attack methods. The paper also only uses Llama-2-7B-Chat as the sour

Code & Models

Repositories

tmllab/2025_ICLR_PiF
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCybercrime and Law Enforcement Studies · Digital and Cyber Forensics · Advanced Malware Detection Techniques

MethodsFocus