TL;DR
This paper introduces ArrAttack, a novel method for generating robust jailbreak prompts that can bypass defenses in large language models, demonstrating high transferability and effectiveness across various models and defenses.
Contribution
The paper presents ArrAttack, a universal attack method that automatically creates robust jailbreak prompts capable of bypassing defended LLMs, with a universal robustness judgment model for evaluation.
Findings
ArrAttack outperforms existing attack strategies in effectiveness.
ArrAttack demonstrates strong transferability across different models.
The approach effectively bypasses various defense mechanisms.
Abstract
Safety alignment in large language models (LLMs) is increasingly compromised by jailbreak attacks, which can manipulate these models to generate harmful or unintended content. Investigating these attacks is crucial for uncovering model vulnerabilities. However, many existing jailbreak strategies fail to keep pace with the rapid development of defense mechanisms, such as defensive suffixes, rendering them ineffective against defended models. To tackle this issue, we introduce a novel attack method called ArrAttack, specifically designed to target defended LLMs. ArrAttack automatically generates robust jailbreak prompts capable of bypassing various defense measures. This capability is supported by a universal robustness judgment model that, once trained, can perform robustness evaluation for any target model with a wide variety of defenses. By leveraging this model, we can rapidly develop…
Peer Reviews
Decision·ICLR 2025 Poster
1. This paper notices the limitations of existing jailbreak attacks on defended LLMs and presents a unique approach to jailbreak prompt generation by combining a robustness judgment model with a rewriting-based generation technique. 2. The paper is well-structured, with a clear problem formulation and a detailed description of the proposed methods. 3. The experiments are comprehensive, evaluating multiple models with diverse architectures and defense mechanisms. The authors employ various eval
1. The authors focus on jailbreak attacks in defense scenarios. However, this has not been thoroughly validated across all defense types, which may limit generalizability. The defense methods tested in the paper are all system-level, focusing on input-level defenses. The authors could strengthen the study by including model-level defense mechanisms, such as unlearning [1] and adversarial fine-tuning. 2. The authors should provide more detailed case studies for the robustness judgment model and
1. The paper presents a new approach by integrating a robustness judgment model with the generation of jailbreak prompts, enhancing both the efficiency and robustness of attacks. However, similar concepts have already been proposed, such as PAIR[1] and TAP[2], which utilize LLMs to iteratively rewrite adversarial prompts to achieve a high ASR. 2. The paper is generally well-structured, providing clear explanations of the proposed method, its components, and the evaluation criteria. [1] Jailbrea
1. While ArrAttack improves robustness, the core attack methodology—particularly the rewriting-based approach—is not entirely new, as similar strategies have been explored in previous works like PAIR and TAP. 2. The focus on defense-enhanced LLMs is limited by incomplete experimentation. Several key model-level defenses [3], such as safety training and unlearning, have not been thoroughly examined. 3. The paper lacks comparison with similar baselines, such as PAIR and TAP, making it difficult to
1 The soundness of this paper is good. 2 In Table 1, the results indicate that ArrAttack can not only achieve high ASR, but also obtains PPL score. 3 The transferability of jailbreak prompts generated by ArrAttacks is good.
1 I think the writing of this paper is not satisfying, especially the method and the experimental section. To be honest, I think ArrAttack itself is not so hard to understand. But I indeed try hard to follow the writers' chain of thought. In Section 3, the authors should introduce how ArrAttack is motivated and how the pipeline of ArrAttack works rather than the detailed settings of the hyperparameters. In Section 4, instead of combining all results into a huge subsection "RESULTS", you should d
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Layer Normalization · Byte Pair Encoding
