Coordinated Robustness Evaluation Framework for Vision-Language Models

Ashwin Ramesh Babu; Sajad Mousavi; Vineet Gundecha; Sahand Ghorbanpour; Avisek Naug; Antonio Guillen; Ricardo Luna Gutierrez; Soumyendu Sarkar

arXiv:2506.05429·cs.CV·June 9, 2025

Coordinated Robustness Evaluation Framework for Vision-Language Models

Ashwin Ramesh Babu, Sajad Mousavi, Vineet Gundecha, Sahand Ghorbanpour, Avisek Naug, Antonio Guillen, Ricardo Luna Gutierrez, Soumyendu Sarkar

PDF

Open Access

TL;DR

This paper introduces a coordinated adversarial attack framework that evaluates and exposes vulnerabilities in vision-language models by perturbing both image and text inputs simultaneously, revealing robustness weaknesses.

Contribution

The work presents a novel joint perturbation method for vision-language models, outperforming existing attacks and highlighting their robustness challenges in multi-modal tasks.

Findings

01

Outperforms other multi-modal attack strategies

02

Effectively compromises state-of-the-art vision-language models

03

Reveals robustness vulnerabilities in pre-trained models

Abstract

Vision-language models, which integrate computer vision and natural language processing capabilities, have demonstrated significant advancements in tasks such as image captioning and visual question and answering. However, similar to traditional models, they are susceptible to small perturbations, posing a challenge to their robustness, particularly in deployment scenarios. Evaluating the robustness of these models requires perturbations in both the vision and language modalities to learn their inter-modal dependencies. In this work, we train a generic surrogate model that can take both image and text as input and generate joint representation which is further used to generate adversarial perturbations for both the text and image modalities. This coordinated attack strategy is evaluated on the visual question and answering and visual reasoning datasets using various state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Topic Modeling