How Visual Representations Map to Language Feature Space in Multimodal LLMs
Constantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip Torr, Neel Nanda

TL;DR
This paper investigates how visual features are mapped into language representations in multimodal large language models by analyzing layer-wise alignment using frozen models and autoencoders, revealing gradual convergence in middle-to-late layers.
Contribution
It introduces a novel analysis framework using frozen models and autoencoders to understand the layer-wise alignment process in multimodal LLMs, highlighting the role of adapters in cross-modal mapping.
Findings
Visual representations gradually align with language features in middle-to-late layers.
Early LLM layers show a misalignment with ViT outputs.
Layer-wise analysis reveals the progression of cross-modal feature integration.
Abstract
Effective multimodal reasoning depends on the alignment of visual and linguistic representations, yet the mechanisms by which vision-language models (VLMs) achieve this alignment remain poorly understood. Following the LiMBeR framework, we deliberately maintain a frozen large language model (LLM) and a frozen vision transformer (ViT), connected solely by training a linear adapter during visual instruction tuning. By keeping the language model frozen, we ensure it maintains its original language representations without adaptation to visual data. Consequently, the linear adapter must map visual features directly into the LLM's existing representational space rather than allowing the language model to develop specialized visual understanding through fine-tuning. Our experimental design uniquely enables the use of pre-trained sparse autoencoders (SAEs) of the LLM as analytical probes. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
