Investigating Mechanisms for In-Context Vision Language Binding

Darshana Saravanan; Makarand Tapaswi; Vineet Gandhi

arXiv:2505.22200·cs.CV·May 29, 2025

Investigating Mechanisms for In-Context Vision Language Binding

Darshana Saravanan, Makarand Tapaswi, Vineet Gandhi

PDF

Open Access

TL;DR

This paper explores how vision-language models associate objects in images with textual descriptions by analyzing the Binding ID mechanism, revealing that models assign consistent identifiers to linked image and text tokens.

Contribution

It extends the Binding ID concept from language models to vision-language models, demonstrating how VLMs form associations between image objects and text references.

Findings

01

VLMs assign consistent Binding IDs to image objects and their textual descriptions.

02

Binding IDs enable effective in-context association between visual and textual information.

03

The study uses synthetic datasets to analyze binding mechanisms in VLMs.

Abstract

To understand a prompt, Vision-Language models (VLMs) must perceive the image, comprehend the text, and build associations within and across both modalities. For instance, given an 'image of a red toy car', the model should associate this image to phrases like 'car', 'red toy', 'red object', etc. Feng and Steinhardt propose the Binding ID mechanism in LLMs, suggesting that the entity and its corresponding attribute tokens share a Binding ID in the model activations. We investigate this for image-text binding in VLMs using a synthetic dataset and task that requires models to associate 3D objects in an image with their descriptions in the text. Our experiments demonstrate that VLMs assign a distinct Binding ID to an object's image tokens and its textual references, enabling in-context association.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Neurobiology of Language and Bilingualism