MM-R$^3$: On (In-)Consistency of Vision-Language Models (VLMs)
Shih-Han Chou, Shivam Chandhok, James J. Little, Leonid Sigal

TL;DR
This paper introduces the MM-R3 benchmark to evaluate the consistency of vision-language models across different tasks, revealing that higher accuracy does not always imply higher consistency, and proposes an adapter-based mitigation strategy that improves consistency significantly.
Contribution
The paper presents the MM-R3 benchmark for assessing VLM consistency and introduces a simple adapter method that enhances consistency without sacrificing accuracy.
Findings
Consistency and accuracy are not always correlated in VLMs.
The proposed adapter improves model consistency by up to 12.5%.
Models with higher accuracy are not necessarily more consistent.
Abstract
With the advent of LLMs and variants, a flurry of research has emerged, analyzing the performance of such models across an array of tasks. While most studies focus on evaluating the capabilities of state-of-the-art (SoTA) Vision Language Models (VLMs) through task accuracy (e.g., visual question answering, grounding), our work explores the related but complementary aspect of consistency - the ability of a VLM to produce semantically similar or identical responses to semantically similar queries. We note that consistency is a fundamental prerequisite (necessary but not sufficient condition) for robustness and trust in VLMs. Armed with this perspective, we propose the MM-R3 benchmark, which allows us to analyze performance, in terms of consistency and accuracy, of SoTA VLMs on three tasks: Question Rephrasing, Image Restyling, and Context Reasoning. Our analysis reveals that consistency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsAdapter · ALIGN · Focus
