How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation
Zhuohang Long, Siyuan Wang, Shujun Liu, Yuhang Lai, Xuanjing Huang,, Zhongyu Wei

TL;DR
This paper investigates how jailbreak defenses work in large vision-language models, identifying key mechanisms and proposing ensemble strategies to improve safety without sacrificing helpfulness.
Contribution
It systematically analyzes jailbreak defenses, introduces two key mechanisms, and develops ensemble strategies to balance safety and helpfulness in LVLMs.
Findings
Ensemble defenses improve safety and helpfulness trade-offs.
Safety shift increases overall refusal rates.
Harmfulness discrimination enhances harmful input detection.
Abstract
Jailbreak attacks, where harmful prompts bypass generative models' built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between safety and helpfulness, and their application to Large Vision-Language Models (LVLMs), are not well understood. This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem to assess model refusal tendencies for both harmful and benign queries. We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model's ability to distinguish between harmful and benign inputs. Using these mechanisms, we develop two ensemble defense strategies-inter-mechanism ensembles and intra-mechanism ensembles-to balance safety and helpfulness.…
Peer Reviews
Decision·Submitted to ICLR 2025
The author tackle an important problem of characterizing how LLM jailbreak defenses work. The paper is well written and motivated. I appreciate the author's effort in evaluating a variety of defenses. Moreover, attributing defenses to safety shift and harmfulness discrimination is an interesting idea.
1. The analysis in this paper can also be applied to text-only LLMs. Since, text-only LLMs are more widely used, the authors should consider expanding the analysis. 2. The whole analysis focuses on affirmative response on benign queries and refusal on harmful queries. However, it does not take into account the quality of the generated responses (specially since the evaluation uses a pattern matching based judge). Combining multiple defenses could severely harm the quality of the returned respons
1. Introduces novel defense mechanisms (safety shift and harmfulness discrimination) for LVLMs, providing fresh insights into model security. 2. Includes a comprehensive analysis supported by rigorous empirical validation across various datasets and models, utilizing a robust methodology by reformulating generation tasks into classification problems. 3. Evaluates two ensemble defense strategies (inter-mechanism and intra-mechanism integration), examining the balance between enhancing model safe
1. The captions for the figures and tables in the paper are overly simplistic. 2. The experiments primarily rely on the MM-SafetyBench and MOSSBench datasets, which may not fully reflect the diversity and complexity of real-world scenarios. 3. While the paper proposes various defense strategies, it may not adequately discuss their feasibility and cost-effectiveness in practical deployment.
1. A new angle to investigate jailbreak defenses is proposed. It is interesting. 2. The reformulation practice is interesting and valuable, providing a effective way to investigate the mechanism of jailbreak defenses. 3. Extensive experiments across various jailbreak defenses are conducted.
1. The motivation behind focusing on LVLMs is not clear. 2. Further analysis on the results (especially the ensemble part) is needed. 3. The selection of models under evaluation is not convincing.
Videos
Taxonomy
TopicsCrime Patterns and Interventions
