Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails
Yijun Yang, Lichao Wang, Xiao Yang, Lanqing Hong, Jun Zhu

TL;DR
This paper introduces MultiFaceted Attack, a comprehensive black-box method that effectively bypasses multi-layered safety defenses in vision large language models, exposing significant vulnerabilities.
Contribution
It presents a novel multi-faceted attack framework that systematically breaches safety mechanisms in VLLMs, demonstrating high success rates against commercial models.
Findings
Achieves 61.56% attack success rate on eight VLLMs
Surpasses state-of-the-art methods by at least 42.18%
Effectively exploits multimodal and alignment vulnerabilities
Abstract
Vision Large Language Models (VLLMs) integrate visual data processing, expanding their real-world applications, but also increasing the risk of generating unsafe responses. In response, leading companies have implemented Multi-Layered safety defenses, including alignment training, safety system prompts, and content moderation. However, their effectiveness against sophisticated adversarial attacks remains largely unexplored. In this paper, we propose MultiFaceted Attack, a novel attack framework designed to systematically bypass Multi-Layered Defenses in VLLMs. It comprises three complementary attack facets: Visual Attack that exploits the multimodal nature of VLLMs to inject toxic system prompts through images; Alignment Breaking Attack that manipulates the model's alignment mechanism to prioritize the generation of contrasting responses; and Adversarial Signature that deceives content…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
