Teaching VLMs to Localize Specific Objects from In-context Examples
Sivan Doveh, Nimrod Shabtay, Wei Lin, Eli Schwartz, Hilde Kuehne, Raja, Giryes, Rogerio Feris, Leonid Karlinsky, James Glass, Assaf Arbelle, Shimon, Ullman, M. Jehanzeb Mirza

TL;DR
This paper introduces a data-centric fine-tuning approach to improve vision-language models' ability to localize specific objects in images based on few-shot examples, emphasizing personalized, context-aware localization.
Contribution
It presents a novel fine-tuning method using video object tracking data and a regularization technique to enhance few-shot personalized object localization in VLMs.
Findings
Significant performance improvement in few-shot localization tasks.
Enhanced context awareness in VLMs without losing generalization.
First benchmark for personalized few-shot localization in VLMs.
Abstract
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we find that present-day VLMs (including the proprietary GPT-4o) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. In this work, we focus on the task of few-shot personalized localization, where a model is given a small set of annotated images (in-context examples) -- each with a category label and bounding box -- and is tasked with localizing the same object type in a query image. Personalized localization can be particularly important in cases of ambiguity of several related objects that can respond to a text or an object that is hard to describe with words. To provoke…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Robot Manipulation and Learning · AI-based Problem Solving and Planning
MethodsSparse Evolutionary Training · Focus
