Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

Futa Waseda; Antonio Tejero-de-Pablos; Isao Echizen

arXiv:2405.18770·cs.CV·January 6, 2026

Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen

PDF

Open Access

TL;DR

This paper introduces a novel multimodal adversarial training method to defend vision-language models against attacks on both image and text modalities, leveraging one-to-many relationships to improve robustness.

Contribution

It pioneers defense strategies against multimodal attacks in vision-language models and explores the use of diverse, well-aligned data augmentation leveraging one-to-many relationships.

Findings

01

Multimodal adversarial training significantly outperforms unimodal defenses.

02

Leveraging one-to-many relationships enhances robustness with well-aligned, diverse data.

03

Proper augmentation avoids distribution shift, improving defense effectiveness.

Abstract

Pre-trained vision-language (VL) models are highly vulnerable to adversarial attacks. However, existing defense methods primarily focus on image classification, overlooking two key aspects of VL tasks: multimodal attacks, where both image and text can be perturbed, and the one-to-many relationship of images and texts, where a single image can correspond to multiple textual descriptions and vice versa (1:N and N:1). This work is the first to explore defense strategies against multimodal attacks in VL tasks, whereas prior VL defense methods focus on vision robustness. We propose multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities during training, significantly outperforming existing unimodal defenses. Furthermore, we discover that MAT is limited by deterministic one-to-one (1:1) image-text pairs in VL training data. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning

MethodsFocus