Failures to Find Transferable Image Jailbreaks Between Vision-Language   Models

Rylan Schaeffer; Dan Valentine; Luke Bailey; James Chua; Crist\'obal; Eyzaguirre; Zane Durante; Joe Benton; Brando Miranda; Henry Sleight; John; Hughes; Rajashree Agrawal; Mrinank Sharma; Scott Emmons; Sanmi Koyejo; Ethan; Perez

arXiv:2407.15211·cs.CL·December 17, 2024

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Crist\'obal, Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John, Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan, Perez

PDF

Open Access

TL;DR

This study empirically investigates the transferability of gradient-based image jailbreaks across diverse vision-language models, finding such transferability to be extremely limited and highlighting the robustness of VLMs against these attacks.

Contribution

The paper provides the first large-scale empirical analysis of transferability of image jailbreaks in VLMs, revealing their limited transferability and robustness.

Findings

01

Transferable gradient-based image jailbreaks are very rare.

02

Transfer success is mainly between identical or very similar VLMs.

03

VLMs show greater robustness to transfer attacks compared to language models and image classifiers.

Abstract

The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conducted a large-scale empirical study to assess the transferability of gradient-based universal image ``jailbreaks" using a diverse set of over 40 open-parameter VLMs, including 18 new VLMs that we publicly release. Overall, we find that transferable gradient-based image jailbreaks are extremely difficult to obtain. When an image jailbreak is optimized against a single VLM or against an ensemble of VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits little-to-no transfer to any other VLMs; transfer is not affected by whether…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection

MethodsSparse Evolutionary Training · Focus