Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to   Jailbreak Large Vision-Language Models

Chenhang Cui; Gelei Deng; An Zhang; Jingnan Zheng; Yicong Li; Lianli; Gao; Tianwei Zhang; Tat-Seng Chua

arXiv:2411.11496·cs.CL·December 2, 2024

Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

Chenhang Cui, Gelei Deng, An Zhang, Jingnan Zheng, Yicong Li, Lianli, Gao, Tianwei Zhang, Tat-Seng Chua

PDF

Open Access 1 Repo

TL;DR

This paper uncovers how safe images can be exploited to jailbreak large vision-language models by leveraging their reasoning capabilities and safety snowball effect, introducing a novel agent-based framework called Safety Snowball Agent (SSA).

Contribution

The study reveals the vulnerability of LVLMs to image-based jailbreaks and proposes SSA, a new agent-based method that exploits inherent model properties to induce unsafe outputs.

Findings

01

SSA achieves high success rates in jailbreaking LVLMs.

02

Nearly any image can be used to induce unsafe content.

03

The approach challenges existing safety enforcement in multimodal systems.

Abstract

Recent advances in Large Vision-Language Models (LVLMs) have showcased strong reasoning abilities across multiple modalities, achieving significant breakthroughs in various real-world applications. Despite this great success, the safety guardrail of LVLMs may not cover the unforeseen domains introduced by the visual modality. Existing studies primarily focus on eliciting LVLMs to generate harmful responses via carefully crafted image-based jailbreaks designed to bypass alignment defenses. In this study, we reveal that a safe image can be exploited to achieve the same jailbreak consequence when combined with additional safe images and prompts. This stems from two fundamental properties of LVLMs: universal reasoning capabilities and safety snowball effect. Building on these insights, we propose Safety Snowball Agent (SSA), a novel agent-based framework leveraging agents' autonomous and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gzcch/safety_snowball_agent
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Law in Society and Culture · Digital Media Forensic Detection

MethodsFocus