Gradient-based Jailbreak Images for Multimodal Fusion Models
Javier Rando, Hannah Korevaar, Erik Brinkman, Ivan Evtimov, Florian, Tram\`er

TL;DR
This paper introduces a novel gradient-based attack method using tokenizer shortcuts to generate jailbreak images for multimodal fusion models, revealing vulnerabilities and evaluating defenses.
Contribution
It presents the first end-to-end gradient image attack leveraging tokenizer shortcuts for multimodal models, demonstrating effectiveness and efficiency over text-based attacks.
Findings
Jailbreak images elicit harmful responses in 72.5% of prompts.
Attacks outperform text jailbreaks with 3x lower compute.
Representation engineering defenses transfer effectively to image attacks.
Abstract
Augmenting language models with image inputs may enable more effective jailbreak attacks through continuous optimization, unlike text inputs that require discrete optimization. However, new multimodal fusion models tokenize all input modalities using non-differentiable functions, which hinders straightforward attacks. In this work, we introduce the notion of a tokenizer shortcut that approximates tokenization with a continuous function and enables continuous optimization. We use tokenizer shortcuts to create the first end-to-end gradient image attacks against multimodal fusion models. We evaluate our attacks on Chameleon models and obtain jailbreak images that elicit harmful information for 72.5% of prompts. Jailbreak images outperform text jailbreaks optimized with the same objective and require 3x lower compute budget to optimize 50x more input tokens. Finally, we find that…
Peer Reviews
Decision·Submitted to ICLR 2025
* This is a novel method that solves the core challenge of creating gradient-based image jailbreaks for multimodal fusion models. * Understanding the vulnerabilities in multimodal models is important for developing more robust systems, and gradient-based jailbreaking of fusion-based models has been under-explored. * The authors use good baselines for their experiments (GCG and refusal direction attacks), and convincingly demonstrate the success of their method * The experiments are thoroug
* The dataset used is quite small, with only 80 prompts in the test set for direct attacks and 20 in the test set for transfer attacks. The results would be more convincing if done on a larger dataset. In addition, only a single dataset is tested. * The paper does not include any examples of jailbroken model responses - these are helpful for qualitative understanding of the attack. * With the exception of table 1, the results given are all for models using the tokenizer shortcut. It would be h
- The choice of studying robustness of multimodal fusion models is timely. - The selection of research questions is fitting for a first study in a fast-paced field. The hypothesis that it may be easier to attack models with this architecture is interesting, and is very useful to study early in the uptake of architectures. - The paragraph writing style is easy to read, and the work can serve as an interesting log of experiments for other practitioners.
- The choice of the two shortcut is not clearly explained in section 3. It would be useful to spell it out. - It would be useful to have more qualitative analysis or at least examples of jailbreaking images vs images that fail.
1. This paper addresses an important reasearch topic of jail-breaking in VL-LLM models, considering the significant growing use of VL models in real world applications. Research in this direction seems essential. 2. This paper is well presented, making the paper easy to follow and understand.
1. There is a lack comparison or discussion with other condidates to make quantizqation differentiable. If the proposed method achieves very strong performance in generating jail-breaking iamges, current approach would be acceptable. However, it seems that the proposed method can generate jail-break images in very limited settings: with shortcut or non-transfer setting. 2. As far as i understand, the white-box attack scenario is important because, although it may be impractical and unrealistic,
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging and Analysis · Medical Imaging Techniques and Applications
