One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang, Jiawei Kong, Wenbo Yu, Bin Chen, Jiawei Li, Hao Wu, Shutao Xia, Ke Xu

TL;DR
This paper introduces a novel universal adversarial perturbation method for vision-language models that can attack any input without per-sample customization, significantly compromising model alignment and performance.
Contribution
The paper proposes C-PGC, a contrastive-training based universal adversarial perturbation generator that effectively disrupts vision-language pre-training models using cross-modal and unimodal guidance.
Findings
C-PGC successfully attacks various VLP models across multiple tasks.
The universal perturbation significantly degrades model alignment and accuracy.
The method outperforms previous instance-specific adversarial approaches.
Abstract
Vision-Language Pre-training (VLP) models have exhibited unprecedented capability in many applications by taking full advantage of the multimodal alignment. However, previous studies have shown they are vulnerable to maliciously crafted adversarial samples. Despite recent success, these methods are generally instance-specific and require generating perturbations for each input sample. In this paper, we reveal that VLP models are also vulnerable to the instance-agnostic universal adversarial perturbation (UAP). Specifically, we design a novel Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC) to achieve the attack. In light that the pivotal multimodal alignment is achieved through the advanced contrastive learning technique, we devise to turn this powerful weapon against themselves, i.e., employ a malicious version of contrastive learning to train the C-PGC…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper focuses on an important task of evaluating robustness of VLP models. 2. Both adversarial images and texts are learned.
Several concerns remain: 1. Motivation: - In the abstract, the authors claim to "fully utilize the characteristics of Vision-and-Language (V+L) scenarios by incorporating both unimodal and cross-modal information." However, the authors do not seem to fully exploit the characteristics of different V+L scenarios or tasks. - In the introduction, Figure 1 compares two methods and claims that "the generator-based approach GAP consistently achieves superior ASR compared to UAP." Since UAP uses the De
1. The proposed UAP framework addresses the inefficiencies of instance-specific attacks by incorporating cross-modal and unimodal guidance within a contrastive training setup, representing an advancement in universal adversarial attack methods. 2. The paper thoroughly evaluates C-PGC's effectiveness across multiple VLP models and downstream tasks, and additionally analyzes various defense strategies to mitigate the potential risks posed by C-PGC.
The proposed method leverages image and text attacks alongside cross-modal contrastive learning to generate universal adversarial perturbations. While this approach shows promise, the novelty may not be fully evident. I recommend that the authors consider further highlighting and reorganizing the unique contributions of the paper to enhance its clarity and impact.
The writing style of the paper is commendably clear and concise, making it accessible to a broad audience within the machine learning and computer vision communities. The authors have taken care to present the technical details in a manner that is straightforward and easy to follow, even for readers who may not be deeply familiar with adversarial attacks or VLMs. The method’s components are explained in a way that balances technical rigor with simplicity. This makes the paper highly readable and
The paper, while strong overall, has several areas for improvement: 1. **Use of Contrastive Loss** The inclusion of contrastive loss (\(\mathcal{L}_{CL}\)) feels somewhat forced. Since the goal is to perform untargeted attacks, it seems unnecessary to rely on contrastive loss, which is typically used to enforce alignment between representations. While the authors have shown its utility through ablation studies, the logical foundation of using contrastive loss in an untargeted setting remai
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Topic Modeling
MethodsFocus · Contrastive Learning
