Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector
Youcheng Huang, Fengbin Zhu, Jingkun Tang, Pan Zhou, Wenqiang Lei,, Jiancheng Lv, Tat-Seng Chua

TL;DR
This paper introduces RADAR, a large-scale adversarial image dataset, and NEARSIDE, a novel detection method using a single embedding vector, to improve the safety of vision-language models against adversarial attacks.
Contribution
It presents a new large-scale adversarial dataset and a novel embedding-based detection method that is effective, efficient, and transferable across models.
Findings
NEARSIDE effectively detects adversarial images against LLaVA and MiniGPT-4.
The method demonstrates high efficiency and cross-model transferability.
RADAR provides a comprehensive dataset for adversarial attack research.
Abstract
Visual Language Models (VLMs) are vulnerable to adversarial attacks, especially those from adversarial images, which is however under-explored in literature. To facilitate research on this critical safety problem, we first construct a new laRge-scale Adervsarial images dataset with Diverse hArmful Responses (RADAR), given that existing datasets are either small-scale or only contain limited types of harmful responses. With the new RADAR dataset, we further develop a novel and effective iN-time Embedding-based AdveRSarial Image DEtection (NEARSIDE) method, which exploits a single vector that distilled from the hidden states of VLMs, which we call the attacking direction, to achieve the detection of adversarial images against benign ones in the input. Extensive experiments with two victim VLMs, LLaVA and MiniGPT-4, well demonstrate the effectiveness, efficiency, and cross-model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications
