How Do Vision-Language Models Process Conflicting Information Across Modalities?

Tianze Hua; Tian Yun; Ellie Pavlick

arXiv:2507.01790·cs.CL·July 3, 2025

How Do Vision-Language Models Process Conflicting Information Across Modalities?

Tianze Hua, Tian Yun, Ellie Pavlick

PDF

Open Access

TL;DR

This study investigates how vision-language models handle conflicting multimodal inputs, revealing their tendency to favor one modality, the internal mechanisms behind this, and methods to control their responses for better consistency.

Contribution

The paper uncovers how vision-language models process conflicting information, identifies internal structures influencing modality preference, and introduces router heads to improve response control.

Findings

01

Models often favor one modality over the other.

02

Internal representations reflect modality preference.

03

Router heads can be manipulated to control responses.

Abstract

AI models are increasingly required to be multimodal, integrating disparate input streams into a coherent state representation on which subsequent behaviors and actions can be based. This paper seeks to understand how such models behave when input streams present conflicting information. Focusing specifically on vision-language models, we provide inconsistent inputs (e.g., an image of a dog paired with the caption "A photo of a cat") and ask the model to report the information present in one of the specific modalities (e.g., "What does the caption say / What is in the image?"). We find that models often favor one modality over the other, e.g., reporting the image regardless of what the caption says, but that different models differ in which modality they favor. We find evidence that the behaviorally preferred modality is evident in the internal representational structure of the model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Action Observation and Synchronization