How Do Vision-Language Models Process Conflicting Information Across Modalities?
Tianze Hua, Tian Yun, Ellie Pavlick

TL;DR
This study investigates how vision-language models handle conflicting multimodal inputs, revealing their tendency to favor one modality, the internal mechanisms behind this, and methods to control their responses for better consistency.
Contribution
The paper uncovers how vision-language models process conflicting information, identifies internal structures influencing modality preference, and introduces router heads to improve response control.
Findings
Models often favor one modality over the other.
Internal representations reflect modality preference.
Router heads can be manipulated to control responses.
Abstract
AI models are increasingly required to be multimodal, integrating disparate input streams into a coherent state representation on which subsequent behaviors and actions can be based. This paper seeks to understand how such models behave when input streams present conflicting information. Focusing specifically on vision-language models, we provide inconsistent inputs (e.g., an image of a dog paired with the caption "A photo of a cat") and ask the model to report the information present in one of the specific modalities (e.g., "What does the caption say / What is in the image?"). We find that models often favor one modality over the other, e.g., reporting the image regardless of what the caption says, but that different models differ in which modality they favor. We find evidence that the behaviorally preferred modality is evident in the internal representational structure of the model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Action Observation and Synchronization
