Phase Diagram of Vision Large Language Models Inference: A Perspective from Interaction across Image and Instruction
Houjing Wei, Yuting Shi, Naoya Inoue

TL;DR
This paper explores the internal interaction dynamics of Vision Large Language Models during inference, revealing a four-phase process of modality alignment, intra-modal encoding, inter-modal fusion, and output preparation.
Contribution
It introduces a novel framework for analyzing multimodal interactions in VLLMs, uncovering a four-phase inference dynamic across model layers.
Findings
Four-phase inference dynamics identified in VLLMs.
Early layers show modality alignment and intra-modal encoding.
Later layers exhibit inter-modal fusion and output alignment.
Abstract
Vision Large Language Models (VLLMs) usually take input as a concatenation of image token embeddings and text token embeddings and conduct causal modeling. However, their internal behaviors remain underexplored, raising the question of interaction among two types of tokens. To investigate such multimodal interaction during model inference, in this paper, we measure the contextualization among the hidden state vectors of tokens from different modalities. Our experiments uncover a four-phase inference dynamics of VLLMs against the depth of Transformer-based LMs, including (I) Alignment: In very early layers, contextualization emerges between modalities, suggesting a feature space alignment. (II) Intra-modal Encoding: In early layers, intra-modal contextualization is enhanced while inter-modal interaction is suppressed, suggesting a local encoding within modalities. (III) Inter-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling
