How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning

Hongxuan Wu; Yukun Zhang; Xueqing Zhou

arXiv:2602.15580·cs.AI·February 18, 2026

How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning

Hongxuan Wu, Yukun Zhang, Xueqing Zhou

PDF

Open Access

TL;DR

This paper introduces a layer-wise information-theoretic framework using Partial Information Decomposition to analyze how multimodal Transformers process visual and linguistic information, revealing a consistent pattern of modality transduction and its task-dependent dynamics.

Contribution

It develops PID Flow, a novel pipeline for tractable high-dimensional information decomposition, and applies it to uncover the evolution of visual and linguistic information across Transformer layers.

Findings

01

Visual-unique information peaks early and decays with depth.

02

Language-unique information increases in late layers, accounting for ~82% of predictions.

03

Cross-modal synergy remains below 2%, indicating limited fusion at the information level.

Abstract

When a multimodal Transformer answers a visual question, is the prediction driven by visual evidence, linguistic reasoning, or genuinely fused cross-modal computation -- and how does this structure evolve across layers? We address this question with a layer-wise framework based on Partial Information Decomposition (PID) that decomposes the predictive information at each Transformer layer into redundant, vision-unique, language-unique, and synergistic components. To make PID tractable for high-dimensional neural representations, we introduce \emph{PID Flow}, a pipeline combining dimensionality reduction, normalizing-flow Gaussianization, and closed-form Gaussian PID estimation. Applying this framework to LLaVA-1.5-7B and LLaVA-1.6-7B across six GQA reasoning tasks, we uncover a consistent \emph{modal transduction} pattern: visual-unique information peaks early and decays with depth,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Neurobiology of Language and Bilingualism · Language and cultural evolution