Line of Sight: On Linear Representations in VLLMs
Achyuta Rajaram, Sarah Schwettmann, Jacob Andreas, and Arthur Conmy

TL;DR
This paper investigates how visual concepts are represented in large language models with multimodal capabilities, revealing linearly decodable features and their causal influence, and introduces multimodal Sparse Autoencoders for interpretability.
Contribution
It demonstrates the presence of linearly decodable image features in VLLMs and introduces multimodal Sparse Autoencoders to enhance interpretability of visual representations.
Findings
ImageNet classes are represented via linearly decodable features.
Features are causal, as shown by targeted edits affecting outputs.
Representation sharing increases in deeper layers.
Abstract
Language models can be equipped with multimodal capabilities by fine-tuning on embeddings of visual inputs. But how do such multimodal models represent images in their hidden activations? We explore representations of image concepts within LlaVA-Next, a popular open-source VLLM. We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream. We show that the features are causal by performing targeted edits on the model output. In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs), creating a highly interpretable dictionary of text and image features. We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
MethodsSparse Evolutionary Training
