Revisiting the Adversarial Robustness of Vision Language Models: a   Multimodal Perspective

Wanqi Zhou; Shuanghao Bai; Danilo P. Mandic; Qibin Zhao; Badong Chen

arXiv:2404.19287·cs.CV·November 13, 2024

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

Wanqi Zhou, Shuanghao Bai, Danilo P. Mandic, Qibin Zhao, Badong Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multimodal contrastive adversarial training method to enhance the robustness of vision-language models like CLIP against image, text, and multimodal adversarial attacks, addressing a largely unexplored area.

Contribution

It proposes the first comprehensive multimodal adversarial training approach, improving robustness of both image and text encoders against various modality-specific attacks.

Findings

01

MMCoA improves robustness across all attack types

02

Extensive experiments on 15 datasets validate effectiveness

03

Unified framework for multimodal adversarial defense

Abstract

Pretrained vision-language models (VLMs) like CLIP exhibit exceptional generalization across diverse downstream tasks. While recent studies reveal their vulnerability to adversarial attacks, research to date has primarily focused on enhancing the robustness of image encoders against image-based attacks, with defenses against text-based and multimodal attacks remaining largely unexplored. To this end, this work presents the first comprehensive study on improving the adversarial robustness of VLMs against attacks targeting image, text, and multimodal inputs. This is achieved by proposing multimodal contrastive adversarial training (MMCoA). Such an approach strengthens the robustness of both image and text encoders by aligning the clean text embeddings with adversarial image embeddings, and adversarial text embeddings with clean image embeddings. The robustness of the proposed MMCoA is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ellezwq/mmcoa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Language, Metaphor, and Cognition

MethodsContrastive Language-Image Pre-training