How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation

Zhuohang Long; Siyuan Wang; Shujun Liu; Yuhang Lai; Xuanjing Huang,; Zhongyu Wei

arXiv:2502.14486·cs.CR·February 21, 2025

How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation

Zhuohang Long, Siyuan Wang, Shujun Liu, Yuhang Lai, Xuanjing Huang,, Zhongyu Wei

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper investigates how jailbreak defenses work in large vision-language models, identifying key mechanisms and proposing ensemble strategies to improve safety without sacrificing helpfulness.

Contribution

It systematically analyzes jailbreak defenses, introduces two key mechanisms, and develops ensemble strategies to balance safety and helpfulness in LVLMs.

Findings

01

Ensemble defenses improve safety and helpfulness trade-offs.

02

Safety shift increases overall refusal rates.

03

Harmfulness discrimination enhances harmful input detection.

Abstract

Jailbreak attacks, where harmful prompts bypass generative models' built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between safety and helpfulness, and their application to Large Vision-Language Models (LVLMs), are not well understood. This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem to assess model refusal tendencies for both harmful and benign queries. We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model's ability to distinguish between harmful and benign inputs. Using these mechanisms, we develop two ensemble defense strategies-inter-mechanism ensembles and intra-mechanism ensembles-to balance safety and helpfulness.…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

The author tackle an important problem of characterizing how LLM jailbreak defenses work. The paper is well written and motivated. I appreciate the author's effort in evaluating a variety of defenses. Moreover, attributing defenses to safety shift and harmfulness discrimination is an interesting idea.

Weaknesses

1. The analysis in this paper can also be applied to text-only LLMs. Since, text-only LLMs are more widely used, the authors should consider expanding the analysis. 2. The whole analysis focuses on affirmative response on benign queries and refusal on harmful queries. However, it does not take into account the quality of the generated responses (specially since the evaluation uses a pattern matching based judge). Combining multiple defenses could severely harm the quality of the returned respons

Reviewer 02Rating 6Confidence 3

Strengths

1. Introduces novel defense mechanisms (safety shift and harmfulness discrimination) for LVLMs, providing fresh insights into model security. 2. Includes a comprehensive analysis supported by rigorous empirical validation across various datasets and models, utilizing a robust methodology by reformulating generation tasks into classification problems. 3. Evaluates two ensemble defense strategies (inter-mechanism and intra-mechanism integration), examining the balance between enhancing model safe

Weaknesses

1. The captions for the figures and tables in the paper are overly simplistic. 2. The experiments primarily rely on the MM-SafetyBench and MOSSBench datasets, which may not fully reflect the diversity and complexity of real-world scenarios. 3. While the paper proposes various defense strategies, it may not adequately discuss their feasibility and cost-effectiveness in practical deployment.

Reviewer 03Rating 6Confidence 3

Strengths

1. A new angle to investigate jailbreak defenses is proposed. It is interesting. 2. The reformulation practice is interesting and valuable, providing a effective way to investigate the mechanism of jailbreak defenses. 3. Extensive experiments across various jailbreak defenses are conducted.

Weaknesses

1. The motivation behind focusing on LVLMs is not clear. 2. Further analysis on the results (especially the ensemble part) is needed. 3. The selection of models under evaluation is not convincing.

Videos

How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation· underline

Taxonomy

TopicsCrime Patterns and Interventions