Phase Diagram of Vision Large Language Models Inference: A Perspective from Interaction across Image and Instruction

Houjing Wei; Yuting Shi; Naoya Inoue

arXiv:2411.00646·cs.CL·May 16, 2025

Phase Diagram of Vision Large Language Models Inference: A Perspective from Interaction across Image and Instruction

Houjing Wei, Yuting Shi, Naoya Inoue

PDF

Open Access

TL;DR

This paper explores the internal interaction dynamics of Vision Large Language Models during inference, revealing a four-phase process of modality alignment, intra-modal encoding, inter-modal fusion, and output preparation.

Contribution

It introduces a novel framework for analyzing multimodal interactions in VLLMs, uncovering a four-phase inference dynamic across model layers.

Findings

01

Four-phase inference dynamics identified in VLLMs.

02

Early layers show modality alignment and intra-modal encoding.

03

Later layers exhibit inter-modal fusion and output alignment.

Abstract

Vision Large Language Models (VLLMs) usually take input as a concatenation of image token embeddings and text token embeddings and conduct causal modeling. However, their internal behaviors remain underexplored, raising the question of interaction among two types of tokens. To investigate such multimodal interaction during model inference, in this paper, we measure the contextualization among the hidden state vectors of tokens from different modalities. Our experiments uncover a four-phase inference dynamics of VLLMs against the depth of Transformer-based LMs, including (I) Alignment: In very early layers, contextualization emerges between modalities, suggesting a feature space alignment. (II) Intra-modal Encoding: In early layers, intra-modal contextualization is enhanced while inter-modal interaction is suppressed, suggesting a local encoding within modalities. (III) Inter-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling