Towards Interpreting Visual Information Processing in Vision-Language   Models

Clement Neo; Luke Ong; Philip Torr; Mor Geva; David Krueger; Fazl; Barez

arXiv:2410.07149·cs.CV·April 29, 2025·2 cites

Towards Interpreting Visual Information Processing in Vision-Language Models

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, Fazl, Barez

PDF

Open Access 1 Repo

TL;DR

This paper investigates how vision-language models process visual tokens, revealing their interpretability, object localization capabilities, and integration mechanisms, which enhances understanding and control of multimodal AI systems.

Contribution

It provides the first detailed analysis of visual token processing in VLMs, highlighting interpretability, object localization, and information integration mechanisms.

Findings

01

Object identification accuracy drops over 70% when object tokens are removed.

02

Visual token representations become more interpretable across layers.

03

Models extract object information from refined representations at the last token for prediction.

Abstract

Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM. Our approach focuses on analyzing the localization of object information, the evolution of visual token representations across layers, and the mechanism of integrating visual information for predictions. Through ablation studies, we demonstrated that object identification accuracy drops by over 70\% when object-specific tokens are removed. We observed that visual token representations become increasingly interpretable in the vocabulary space across layers, suggesting an alignment with textual tokens corresponding to image content. Finally, we found that the model extracts object information from these refined representations at the last token position for prediction, mirroring the process in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

clemneo/llava-interp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications