TL;DR
This paper introduces a novel information decomposition framework to analyze large vision-language models, revealing their internal decision processes and strategies beyond mere accuracy metrics.
Contribution
It develops a scalable, model-agnostic pipeline using partial information decomposition to profile and understand LVLMs' information dynamics across multiple dimensions.
Findings
Identifies two main task regimes: synergy-driven and knowledge-driven.
Discovers two contrasting family-level strategies: fusion-centric and language-centric.
Uncovers a three-phase pattern in layer-wise processing and highlights visual instruction tuning as key for fusion learning.
Abstract
Large vision-language models (LVLMs) achieve impressive performance, yet their internal decision-making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the "information spectrum" of LVLMs -- decomposing a model's decision-relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model-agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions -- breadth (cross-model & cross-task), depth (layer-wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
