Investigating Mechanisms for In-Context Vision Language Binding
Darshana Saravanan, Makarand Tapaswi, Vineet Gandhi

TL;DR
This paper explores how vision-language models associate objects in images with textual descriptions by analyzing the Binding ID mechanism, revealing that models assign consistent identifiers to linked image and text tokens.
Contribution
It extends the Binding ID concept from language models to vision-language models, demonstrating how VLMs form associations between image objects and text references.
Findings
VLMs assign consistent Binding IDs to image objects and their textual descriptions.
Binding IDs enable effective in-context association between visual and textual information.
The study uses synthetic datasets to analyze binding mechanisms in VLMs.
Abstract
To understand a prompt, Vision-Language models (VLMs) must perceive the image, comprehend the text, and build associations within and across both modalities. For instance, given an 'image of a red toy car', the model should associate this image to phrases like 'car', 'red toy', 'red object', etc. Feng and Steinhardt propose the Binding ID mechanism in LLMs, suggesting that the entity and its corresponding attribute tokens share a Binding ID in the model activations. We investigate this for image-text binding in VLMs using a synthetic dataset and task that requires models to associate 3D objects in an image with their descriptions in the text. Our experiments demonstrate that VLMs assign a distinct Binding ID to an object's image tokens and its textual references, enabling in-context association.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Neurobiology of Language and Bilingualism
