Jailbreaking Vision-Language Models Through the Visual Modality
Aharon Azulay, Jan Dubi\'nski, Zhuoyun Li, Atharv Mittal, Yossi Gandelsman

TL;DR
This paper uncovers vulnerabilities in vision-language models by demonstrating four visual-based jailbreak attacks that bypass safety measures, revealing a gap in cross-modality alignment.
Contribution
It introduces four novel visual attack methods on VLMs, exposing a significant safety alignment gap between visual and textual safety training.
Findings
Visual attacks bypass safety alignment in six frontier VLMs.
Visual cipher achieves 40.9% success rate, outperforming textual cipher.
Visual attacks reveal the need for treating vision as a safety target.
Abstract
The visual modality of vision-language models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual symbol sequences with a decoding legend, (2) replacing harmful objects with benign substitutes (e.g., bomb -> banana) then prompting for harmful actions using the substitute term, (3) replacing harmful text in images (e.g., on book covers) with benign words while visual context preserves the original meaning, and (4) visual analogy puzzles whose solution requires inferring a prohibited concept. Evaluating across six frontier VLMs, our visual attacks bypass safety alignment and expose a cross-modality alignment gap: text-based safety training does not automatically generalize to harmful intent conveyed visually. For example, our visual cipher achieves 40.9%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
