A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models
Haonan Zheng, Xinyang Deng, Wen Jiang, Wenrui Li

TL;DR
This paper introduces a novel attack method, FGA and FGA-T, that leverages text representations to generate adversarial images, revealing vulnerabilities in vision-language pre-training models across multiple scenarios.
Contribution
The paper presents the Feature Guidance Attack (FGA) and its extension FGA-T, enabling effective adversarial attacks on VLP models by integrating text-guided perturbations, bridging unimodal and multimodal attack strategies.
Findings
FGA-T achieves superior attack effects on VLP models.
Data augmentation and momentum improve transferability.
Stable attack performance across datasets and tasks.
Abstract
With Vision-Language Pre-training (VLP) models demonstrating powerful multimodal interaction capabilities, the application scenarios of neural networks are no longer confined to unimodal domains but have expanded to more complex multimodal V+L downstream tasks. The security vulnerabilities of unimodal models have been extensively examined, whereas those of VLP models remain challenging. We note that in CV models, the understanding of images comes from annotated information, while VLP models are designed to learn image representations directly from raw text. Motivated by this discrepancy, we developed the Feature Guidance Attack (FGA), a novel method that uses text representations to direct the perturbation of clean images, resulting in the generation of adversarial images. FGA is orthogonal to many advanced attack strategies in the unimodal domain, facilitating the direct application of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
MethodsFactor Graph Attention
