Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models
Chenhang Cui, Gelei Deng, An Zhang, Jingnan Zheng, Yicong Li, Lianli, Gao, Tianwei Zhang, Tat-Seng Chua

TL;DR
This paper uncovers how safe images can be exploited to jailbreak large vision-language models by leveraging their reasoning capabilities and safety snowball effect, introducing a novel agent-based framework called Safety Snowball Agent (SSA).
Contribution
The study reveals the vulnerability of LVLMs to image-based jailbreaks and proposes SSA, a new agent-based method that exploits inherent model properties to induce unsafe outputs.
Findings
SSA achieves high success rates in jailbreaking LVLMs.
Nearly any image can be used to induce unsafe content.
The approach challenges existing safety enforcement in multimodal systems.
Abstract
Recent advances in Large Vision-Language Models (LVLMs) have showcased strong reasoning abilities across multiple modalities, achieving significant breakthroughs in various real-world applications. Despite this great success, the safety guardrail of LVLMs may not cover the unforeseen domains introduced by the visual modality. Existing studies primarily focus on eliciting LVLMs to generate harmful responses via carefully crafted image-based jailbreaks designed to bypass alignment defenses. In this study, we reveal that a safe image can be exploited to achieve the same jailbreak consequence when combined with additional safe images and prompts. This stems from two fundamental properties of LVLMs: universal reasoning capabilities and safety snowball effect. Building on these insights, we propose Safety Snowball Agent (SSA), a novel agent-based framework leveraging agents' autonomous and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Law in Society and Culture · Digital Media Forensic Detection
MethodsFocus
