On the robustness of multimodal language model towards distractions

Ming Liu; Hao Chen; Jindong Wang; Wensheng Zhang

arXiv:2502.09818·cs.CV·February 17, 2025

On the robustness of multimodal language model towards distractions

Ming Liu, Hao Chen, Jindong Wang, Wensheng Zhang

PDF

Open Access

TL;DR

This paper evaluates the robustness of vision-language models against visual and textual distractions in science question answering, revealing vulnerabilities and exploring mitigation strategies to enhance their real-world applicability.

Contribution

It introduces a new benchmark based on ScienceQA with distractions to assess VLM robustness and analyzes the effectiveness of mitigation strategies.

Findings

01

Most VLMs are vulnerable to distractions, with performance degradation.

02

Models like InternVL2 show higher robustness to distractions.

03

Textual distractions impact models more than visual distractions.

Abstract

Although vision-language models (VLMs) have achieved significant success in various applications such as visual question answering, their resilience to prompt variations remains an under-explored area. Understanding how distractions affect VLMs is crucial for improving their real-world applicability, as inputs could have noisy and irrelevant information in many practical scenarios. This paper aims to assess the robustness of VLMs against both visual and textual distractions in the context of science question answering. Built on the ScienceQA dataset, we developed a new benchmark that introduces distractions in both the visual and textual contexts to evaluate the reasoning capacity of VLMs amid these distractions. Our findings reveal that most-of-the-art VLMs, including GPT-4, are vulnerable to various types of distractions, experiencing noticeable degradation in reasoning capabilities…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Natural Language Processing Techniques