Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for   Jailbreaking Vision-Language Models

Shuyang Hao; Bryan Hooi; Jun Liu; Kai-Wei Chang; Zi Huang; Yujun Cai

arXiv:2411.18000·cs.CV·December 2, 2024

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Shuyang Hao, Bryan Hooi, Jun Liu, Kai-Wei Chang, Zi Huang, Yujun Cai

PDF

Open Access

TL;DR

This paper uncovers visual vulnerabilities in vision-language models and introduces MLAI, a multi-loss adversarial framework that significantly improves jailbreak attack success rates and exposes weaknesses in current safety measures.

Contribution

The paper presents MLAI, a novel multi-loss adversarial attack framework that enhances jailbreak success rates and reveals fundamental visual vulnerabilities in VLMs.

Findings

01

Scenario-matched images amplify harmful outputs.

02

Minimal loss does not ensure attack success.

03

MLAI achieves over 77% success on MiniGPT-4.

Abstract

Despite inheriting security measures from underlying language models, Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues. Through empirical analysis, we uncover two critical findings: scenario-matched images can significantly amplify harmful outputs, and contrary to common assumptions in gradient-based attacks, minimal loss values do not guarantee optimal attack effectiveness. Building on these insights, we introduce MLAI (Multi-Loss Adversarial Images), a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment, exploits flat minima theory for robust adversarial image selection, and employs multi-image collaborative attacks for enhanced effectiveness. Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2, substantially outperforming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital Media Forensic Detection