Teaching VLMs to Localize Specific Objects from In-context Examples

Sivan Doveh; Nimrod Shabtay; Wei Lin; Eli Schwartz; Hilde Kuehne; Raja; Giryes; Rogerio Feris; Leonid Karlinsky; James Glass; Assaf Arbelle; Shimon; Ullman; M. Jehanzeb Mirza

arXiv:2411.13317·cs.CV·March 14, 2025

Teaching VLMs to Localize Specific Objects from In-context Examples

Sivan Doveh, Nimrod Shabtay, Wei Lin, Eli Schwartz, Hilde Kuehne, Raja, Giryes, Rogerio Feris, Leonid Karlinsky, James Glass, Assaf Arbelle, Shimon, Ullman, M. Jehanzeb Mirza

PDF

Open Access 1 Repo

TL;DR

This paper introduces a data-centric fine-tuning approach to improve vision-language models' ability to localize specific objects in images based on few-shot examples, emphasizing personalized, context-aware localization.

Contribution

It presents a novel fine-tuning method using video object tracking data and a regularization technique to enhance few-shot personalized object localization in VLMs.

Findings

01

Significant performance improvement in few-shot localization tasks.

02

Enhanced context awareness in VLMs without losing generalization.

03

First benchmark for personalized few-shot localization in VLMs.

Abstract

Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we find that present-day VLMs (including the proprietary GPT-4o) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. In this work, we focus on the task of few-shot personalized localization, where a model is given a small set of annotated images (in-context examples) -- each with a category label and bounding box -- and is tasked with localizing the same object type in a query image. Personalized localization can be particularly important in cases of ambiguity of several related objects that can respond to a text or an object that is hard to describe with words. To provoke…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sivandoveh/iploc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Robot Manipulation and Learning · AI-based Problem Solving and Planning

MethodsSparse Evolutionary Training · Focus