Mechanisms of Object Localization in Vision-Language Models

Timothy Schauml\"offel; Martina G. Vilas; Gemma Roig

arXiv:2605.19792·cs.CV·May 20, 2026

Mechanisms of Object Localization in Vision-Language Models

Timothy Schauml\"offel, Martina G. Vilas, Gemma Roig

PDF

TL;DR

This paper investigates how vision-language models localize objects, revealing that a containerization mechanism and specific attention heads drive this process, providing insights for future model improvements.

Contribution

It offers the first detailed layer- and head-level analysis of object localization mechanisms in vision-language models, highlighting specialized pathways and attention head roles.

Findings

01

Localization driven by containerization with object-aligned tokens

02

Few attention heads mediate causal effects, concentrated in specific layers

03

Localization and classification rely on largely distinct specialized heads

Abstract

Visually-grounded language models (VLMs) are highly effective in linking visual and textual information, yet they often struggle with basic classification and localization tasks. While classification mechanisms have been studied more extensively, the processes that support object localization remain poorly understood. In this work, we investigate two representative families, LLaVA-1.5 and InternVL-3.5, using a suite of mechanistic interpretability tools, including token ablations, attention knockout, and causal mediation analysis. We find that localization is driven by a containerization mechanism in which object-aligned tokens define the spatial extent of the object, while the semantic arrangement of tokens within those boundaries is largely irrelevant to the predicted box. Only a very small set of attention heads mediates the causal effect for both classification and localization,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.