How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning
Hongxuan Wu, Yukun Zhang, Xueqing Zhou

TL;DR
This paper introduces a layer-wise information-theoretic framework using Partial Information Decomposition to analyze how multimodal Transformers process visual and linguistic information, revealing a consistent pattern of modality transduction and its task-dependent dynamics.
Contribution
It develops PID Flow, a novel pipeline for tractable high-dimensional information decomposition, and applies it to uncover the evolution of visual and linguistic information across Transformer layers.
Findings
Visual-unique information peaks early and decays with depth.
Language-unique information increases in late layers, accounting for ~82% of predictions.
Cross-modal synergy remains below 2%, indicating limited fusion at the information level.
Abstract
When a multimodal Transformer answers a visual question, is the prediction driven by visual evidence, linguistic reasoning, or genuinely fused cross-modal computation -- and how does this structure evolve across layers? We address this question with a layer-wise framework based on Partial Information Decomposition (PID) that decomposes the predictive information at each Transformer layer into redundant, vision-unique, language-unique, and synergistic components. To make PID tractable for high-dimensional neural representations, we introduce \emph{PID Flow}, a pipeline combining dimensionality reduction, normalizing-flow Gaussianization, and closed-form Gaussian PID estimation. Applying this framework to LLaVA-1.5-7B and LLaVA-1.6-7B across six GQA reasoning tasks, we uncover a consistent \emph{modal transduction} pattern: visual-unique information peaks early and decays with depth,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Neurobiology of Language and Bilingualism · Language and cultural evolution
