Representations of Text and Images Align From Layer One
Ev\v{z}en Wybitul, Javier Rando, Florian Tram\`er, Stanislav Fort

TL;DR
This paper demonstrates that image and text representations in vision-language models are aligned from the very first layer, using a new synthesis-based method that visualizes this alignment across multiple concepts and layers.
Contribution
The authors introduce a simple, fast synthesis-based approach to directly visualize and confirm early-layer image-text alignment in vision-language models, challenging previous assumptions.
Findings
Over 50% of images depict recognizable features at layer 1
Method works across hundreds of concepts and seven layers
Provides a new way to interpret model representations
Abstract
We show that for a variety of concepts in adapter-based vision-language models, the representations of their images and their text descriptions are meaningfully aligned from the very first layer. This contradicts the established view that such image-text alignment only appears in late layers. We show this using a new synthesis-based method inspired by DeepDream: given a textual concept such as "Jupiter", we extract its concept vector at a given layer, and then use optimisation to synthesise an image whose representation aligns with that vector. We apply our approach to hundreds of concepts across seven layers in Gemma 3, and find that the synthesised images often depict salient visual features of the targeted textual concepts: for example, already at layer 1, more than 50 % of images depict recognisable features of animals, activities, or seasons. Our method thus provides direct,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Language and cultural evolution
