Mechanisms of Object Localization in Vision-Language Models
Timothy Schauml\"offel, Martina G. Vilas, Gemma Roig

TL;DR
This paper investigates how vision-language models localize objects, revealing that a containerization mechanism and specific attention heads drive this process, providing insights for future model improvements.
Contribution
It offers the first detailed layer- and head-level analysis of object localization mechanisms in vision-language models, highlighting specialized pathways and attention head roles.
Findings
Localization driven by containerization with object-aligned tokens
Few attention heads mediate causal effects, concentrated in specific layers
Localization and classification rely on largely distinct specialized heads
Abstract
Visually-grounded language models (VLMs) are highly effective in linking visual and textual information, yet they often struggle with basic classification and localization tasks. While classification mechanisms have been studied more extensively, the processes that support object localization remain poorly understood. In this work, we investigate two representative families, LLaVA-1.5 and InternVL-3.5, using a suite of mechanistic interpretability tools, including token ablations, attention knockout, and causal mediation analysis. We find that localization is driven by a containerization mechanism in which object-aligned tokens define the spatial extent of the object, while the semantic arrangement of tokens within those boundaries is largely irrelevant to the predicted box. Only a very small set of attention heads mediates the causal effect for both classification and localization,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
