The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models?
Ching-Chia Kao, Chia-Mu Yu, Chun-Shien Lu, Chu-Song Chen

TL;DR
This paper explores the inherent trade-off between the effectiveness and stealthiness of jailbreak attacks on vision-language models, proposing an information-theoretic framework and detection algorithm to enhance model robustness.
Contribution
It introduces an information-theoretic framework based on Fano's inequality to analyze attack-stealthiness trade-offs and presents an efficient detection algorithm for non-stealthy jailbreaks.
Findings
Attack success probability is linked to prompt stealthiness.
Proposed detection algorithm improves robustness against jailbreaks.
Experimental results reveal the tension between attack strength and detectability.
Abstract
Vision-Language Models (VLMs) have achieved remarkable performance on a variety of tasks, yet they remain vulnerable to jailbreak attacks that compromise safety and reliability. In this paper, we provide an information-theoretic framework for understanding the fundamental trade-off between the effectiveness of these attacks and their stealthiness. Drawing on Fano's inequality, we demonstrate how an attacker's success probability is intrinsically linked to the stealthiness of generated prompts. Building on this, we propose an efficient algorithm for detecting non-stealthy jailbreak attacks, offering significant improvements in model robustness. Experimental results highlight the tension between strong attacks and their detectability, providing insights into both adversarial strategies and defense mechanisms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDeception detection and forensic psychology · Digital and Cyber Forensics · Knowledge Management and Technology
MethodsDiffusion
