Cross-modal Information Flow in Multimodal Large Language Models
Zhi Zhang, Srishti Yadav, Fengze Han, Ekaterina Shutova

TL;DR
This paper investigates how visual and linguistic information interact within multimodal large language models, revealing a two-stage process of information transfer across model layers during visual question answering.
Contribution
It provides the first detailed analysis of the internal information flow between vision and language in MLLMs, highlighting distinct processing stages.
Findings
Visual features are transferred to question token representations in lower layers.
Object-specific visual information is integrated at middle layers.
Multimodal representations are propagated to the final prediction in higher layers.
Abstract
The recent advancements in auto-regressive multimodal large language models (MLLMs) have demonstrated promising progress for vision-language tasks. While there exists a variety of studies investigating the processing of linguistic information within large language models, little is currently known about the inner working mechanism of MLLMs and how linguistic and visual information interact within these models. In this study, we aim to fill this gap by examining the information flow between different modalities -- language and vision -- in MLLMs, focusing on visual question answering. Specifically, given an image-question pair as input, we investigate where in the model and how the visual and linguistic information are combined to generate the final prediction. Conducting experiments with a series of models from the LLaVA series, we find that there are two distinct stages in the process…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining
