Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features
Hila Levi, Guy Heller, Dan Levi, Ethan Fetaya

TL;DR
This paper introduces a scalable object-centric image retrieval method that aggregates dense CLIP embeddings, significantly improving accuracy over global feature approaches and enabling efficient large-scale retrieval.
Contribution
The authors propose a novel aggregation of dense CLIP embeddings for object-centric retrieval, balancing scalability and detailed object identification.
Findings
Achieves up to 15 mAP points improvement over global features.
Effectively combines scalability with object-level retrieval capabilities.
Demonstrates advantages in large-scale retrieval frameworks.
Abstract
The task of open-vocabulary object-centric image retrieval involves the retrieval of images containing a specified object of interest, delineated by an open-set text query. As working on large image datasets becomes standard, solving this task efficiently has gained significant practical importance. Applications include targeted performance analysis of retrieved images using ad-hoc queries and hard example mining during training. Recent advancements in contrastive-based open vocabulary systems have yielded remarkable breakthroughs, facilitating large-scale open vocabulary image retrieval. However, these approaches use a single global embedding per image, thereby constraining the system's ability to retrieve images containing relatively small object instances. Alternatively, incorporating local embeddings from detection pipelines faces scalability challenges, making it unsuitable for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
