Gradient-based Jailbreak Images for Multimodal Fusion Models

Javier Rando; Hannah Korevaar; Erik Brinkman; Ivan Evtimov; Florian; Tram\`er

arXiv:2410.03489·cs.CR·October 24, 2024

Gradient-based Jailbreak Images for Multimodal Fusion Models

Javier Rando, Hannah Korevaar, Erik Brinkman, Ivan Evtimov, Florian, Tram\`er

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a novel gradient-based attack method using tokenizer shortcuts to generate jailbreak images for multimodal fusion models, revealing vulnerabilities and evaluating defenses.

Contribution

It presents the first end-to-end gradient image attack leveraging tokenizer shortcuts for multimodal models, demonstrating effectiveness and efficiency over text-based attacks.

Findings

01

Jailbreak images elicit harmful responses in 72.5% of prompts.

02

Attacks outperform text jailbreaks with 3x lower compute.

03

Representation engineering defenses transfer effectively to image attacks.

Abstract

Augmenting language models with image inputs may enable more effective jailbreak attacks through continuous optimization, unlike text inputs that require discrete optimization. However, new multimodal fusion models tokenize all input modalities using non-differentiable functions, which hinders straightforward attacks. In this work, we introduce the notion of a tokenizer shortcut that approximates tokenization with a continuous function and enables continuous optimization. We use tokenizer shortcuts to create the first end-to-end gradient image attacks against multimodal fusion models. We evaluate our attacks on Chameleon models and obtain jailbreak images that elicit harmful information for 72.5% of prompts. Jailbreak images outperform text jailbreaks optimized with the same objective and require 3x lower compute budget to optimize 50x more input tokens. Finally, we find that…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 4

Strengths

* This is a novel method that solves the core challenge of creating gradient-based image jailbreaks for multimodal fusion models. * Understanding the vulnerabilities in multimodal models is important for developing more robust systems, and gradient-based jailbreaking of fusion-based models has been under-explored. * The authors use good baselines for their experiments (GCG and refusal direction attacks), and convincingly demonstrate the success of their method * The experiments are thoroug

Weaknesses

* The dataset used is quite small, with only 80 prompts in the test set for direct attacks and 20 in the test set for transfer attacks. The results would be more convincing if done on a larger dataset. In addition, only a single dataset is tested. * The paper does not include any examples of jailbroken model responses - these are helpful for qualitative understanding of the attack. * With the exception of table 1, the results given are all for models using the tokenizer shortcut. It would be h

Reviewer 02Rating 8Confidence 3

Strengths

- The choice of studying robustness of multimodal fusion models is timely. - The selection of research questions is fitting for a first study in a fast-paced field. The hypothesis that it may be easier to attack models with this architecture is interesting, and is very useful to study early in the uptake of architectures. - The paragraph writing style is easy to read, and the work can serve as an interesting log of experiments for other practitioners.

Weaknesses

- The choice of the two shortcut is not clearly explained in section 3. It would be useful to spell it out. - It would be useful to have more qualitative analysis or at least examples of jailbreaking images vs images that fail.

Reviewer 03Rating 3Confidence 3

Strengths

1. This paper addresses an important reasearch topic of jail-breaking in VL-LLM models, considering the significant growing use of VL models in real world applications. Research in this direction seems essential. 2. This paper is well presented, making the paper easy to follow and understand.

Weaknesses

1. There is a lack comparison or discussion with other condidates to make quantizqation differentiable. If the proposed method achieves very strong performance in generating jail-breaking iamges, current approach would be acceptable. However, it seems that the proposed method can generate jail-break images in very limited settings: with shortcut or non-transfer setting. 2. As far as i understand, the white-box attack scenario is important because, although it may be impractical and unrealistic,

Code & Models

Repositories

facebookresearch/multimodal-fusion-jailbreaks
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Imaging and Analysis · Medical Imaging Techniques and Applications