TL;DR
VLMShield introduces a lightweight safety detector for vision-language models, enhancing robustness against malicious prompts through a novel feature extraction framework and empirical analysis.
Contribution
It proposes MAFE for better multimodal feature fusion and a new safety detector, VLMShield, to defend against malicious prompts efficiently and robustly.
Findings
VLMShield outperforms existing defenses in robustness and efficiency.
Distinct distributional patterns differentiate benign and malicious prompts.
Code implementation is publicly available at the provided GitHub URL.
Abstract
Vision-Language Models (VLMs) face significant safety vulnerabilities from malicious prompt attacks due to weakened alignment during visual integration. Existing defenses suffer from efficiency and robustness. To address these challenges, we first propose the Multimodal Aggregated Feature Extraction (MAFE) framework that enables CLIP to handle long text and fuse multimodal information into unified representations. Through empirical analysis of MAFE-extracted features, we discover distinct distributional patterns between benign and malicious prompts. Building upon this finding, we develop VLMShield, a lightweight safety detector that efficiently identifies multimodal malicious attacks as a plug-and-play solution. Extensive experiments demonstrate superior performance across multiple dimensions, including robustness, efficiency, and utility. Through our work, we hope to pave the way for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
