Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?
Letitia Parcalabescu, Anette Frank

TL;DR
This paper investigates how vision and language models utilize input modalities when generating answers and explanations, revealing that text often dominates contributions and that models are less self-consistent than LLMs, with current decoders still facing challenges.
Contribution
It introduces an analysis of modality reliance and self-consistency in VLM decoders, extending unimodal tests to multimodal models and benchmarking them on the VALSE dataset.
Findings
Text contributions are more significant than image contributions in VLM decoders.
VLM decoders are less self-consistent than large language models.
Current VLM decoders struggle with many phenomena tested by VALSE.
Abstract
Vision and language model (VLM) decoders are currently the best-performing architectures on multimodal tasks. Next to answers, they are able to produce natural language explanations, either in post-hoc or CoT settings. However, it is not clear to what extent they are using the input vision and text modalities when generating answers or explanations. In this work, we investigate if VLMs rely on their input modalities differently when they produce explanations as opposed to answers. We also evaluate the self-consistency of VLM decoders in both post-hoc and CoT explanation settings, by extending existing unimodal tests and measures to VLM decoders. We find that most tested VLMs are less self-consistent than LLMs. Text contributions in all tested VL decoders are more important than image contributions in all examined tasks. However, when comparing explanation generation to answer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling
