How Visual Representations Map to Language Feature Space in Multimodal LLMs

Constantin Venhoff; Ashkan Khakzar; Sonia Joseph; Philip Torr; Neel Nanda

arXiv:2506.11976·cs.CV·June 24, 2025

How Visual Representations Map to Language Feature Space in Multimodal LLMs

Constantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip Torr, Neel Nanda

PDF

TL;DR

This paper investigates how visual features are mapped into language representations in multimodal large language models by analyzing layer-wise alignment using frozen models and autoencoders, revealing gradual convergence in middle-to-late layers.

Contribution

It introduces a novel analysis framework using frozen models and autoencoders to understand the layer-wise alignment process in multimodal LLMs, highlighting the role of adapters in cross-modal mapping.

Findings

01

Visual representations gradually align with language features in middle-to-late layers.

02

Early LLM layers show a misalignment with ViT outputs.

03

Layer-wise analysis reveals the progression of cross-modal feature integration.

Abstract

Effective multimodal reasoning depends on the alignment of visual and linguistic representations, yet the mechanisms by which vision-language models (VLMs) achieve this alignment remain poorly understood. Following the LiMBeR framework, we deliberately maintain a frozen large language model (LLM) and a frozen vision transformer (ViT), connected solely by training a linear adapter during visual instruction tuning. By keeping the language model frozen, we ensure it maintains its original language representations without adaptation to visual data. Consequently, the linear adapter must map visual features directly into the LLM's existing representational space rather than allowing the language model to develop specialized visual understanding through fine-tuning. Our experimental design uniquely enables the use of pre-trained sparse autoencoders (SAEs) of the LLM as analytical probes. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.