The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense
Yangyang Guo, Fangkai Jiao, Liqiang Nie, Mohan Kankanhalli

TL;DR
This paper investigates the paradox of high performance in both attacking and defending Vision Large Language Models (VLLMs), analyzing underlying causes, limitations of current defenses, and proposing a safety-aware detection method to improve trustworthiness.
Contribution
It offers a new explanation for VLLM jailbreak vulnerability, identifies the problem of over-prudence in defenses, and introduces a simple safety-aware detection pipeline.
Findings
VLLMs are vulnerable due to inclusion of vision inputs.
Current defenses suffer from over-prudence, causing unintended abstention.
Evaluation methods for jailbreak often show chance agreement.
Abstract
The vulnerability of Vision Large Language Models (VLLMs) to jailbreak attacks appears as no surprise. However, recent defense mechanisms against these attacks have reached near-saturation performance on benchmark evaluations, often with minimal effort. This \emph{dual high performance} in both attack and defense raises a fundamental and perplexing paradox. To gain a deep understanding of this issue and thus further help strengthen the trustworthiness of VLLMs, this paper makes three key contributions: i) One tentative explanation for VLLMs being prone to jailbreak attacks--\textbf{inclusion of vision inputs}, as well as its in-depth analysis. ii) The recognition of a largely ignored problem in existing defense mechanisms--\textbf{over-prudence}. The problem causes these defense methods to exhibit unintended abstention, even in the presence of benign inputs, thereby undermining their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation and Cyber Security · Safety Systems Engineering in Autonomy · Cybersecurity and Cyber Warfare Studies
