TL;DR
MVI-Bench is a new comprehensive benchmark designed to evaluate the robustness of Large Vision-Language Models against misleading visual inputs, addressing a critical gap in existing evaluation methods.
Contribution
We introduce MVI-Bench, the first benchmark focusing on misleading visual inputs in LVLMs, along with a novel sensitivity metric for detailed robustness assessment.
Findings
State-of-the-art LVLMs show significant vulnerabilities to misleading visual inputs.
MVI-Bench uncovers specific weaknesses at different hierarchical levels of visual misleading cues.
The benchmark and code facilitate future development of more robust LVLMs.
Abstract
Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
