Do Vision & Language Decoders use Images and Text equally? How   Self-consistent are their Explanations?

Letitia Parcalabescu; Anette Frank

arXiv:2404.18624·cs.CL·May 5, 2025

Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?

Letitia Parcalabescu, Anette Frank

PDF

Open Access 1 Repo

TL;DR

This paper investigates how vision and language models utilize input modalities when generating answers and explanations, revealing that text often dominates contributions and that models are less self-consistent than LLMs, with current decoders still facing challenges.

Contribution

It introduces an analysis of modality reliance and self-consistency in VLM decoders, extending unimodal tests to multimodal models and benchmarking them on the VALSE dataset.

Findings

01

Text contributions are more significant than image contributions in VLM decoders.

02

VLM decoders are less self-consistent than large language models.

03

Current VLM decoders struggle with many phenomena tested by VALSE.

Abstract

Vision and language model (VLM) decoders are currently the best-performing architectures on multimodal tasks. Next to answers, they are able to produce natural language explanations, either in post-hoc or CoT settings. However, it is not clear to what extent they are using the input vision and text modalities when generating answers or explanations. In this work, we investigate if VLMs rely on their input modalities differently when they produce explanations as opposed to answers. We also evaluate the self-consistency of VLM decoders in both post-hoc and CoT explanation settings, by extending existing unimodal tests and measures to VLM decoders. We find that most tested VLMs are less self-consistent than LLMs. Text contributions in all tested VL decoders are more important than image contributions in all examined tasks. However, when comparing explanation generation to answer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

heidelberg-nlp/cc-shap-vlm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling