ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections
Ziling Huang, Yidan Zhang, Shin'ichi Satoh

TL;DR
ReSeDis introduces a novel unified task combining large-scale image retrieval with fine-grained object localization based on natural language descriptions, addressing limitations of existing methods.
Contribution
It presents the first benchmark and task for joint corpus-level retrieval and pixel-level grounding, along with a zero-shot baseline using frozen vision-language models.
Findings
Benchmark dataset with unique description-to-object mappings
Proposed metric combining retrieval recall and localization precision
Baseline results indicating significant room for improvement
Abstract
Large-scale visual search engines are expected to solve a dual problem at once: (i) locate every image that truly contains the object described by a sentence and (ii) identify the object's bounding box or exact pixels within each hit. Existing techniques address only one side of this challenge. Visual grounding yields tight boxes and masks but rests on the unrealistic assumption that the object is present in every test image, producing a flood of false alarms when applied to web-scale collections. Text-to-image retrieval excels at sifting through massive databases to rank relevant images, yet it stops at whole-image matches and offers no fine-grained localization. We introduce Referring Search and Discovery (ReSeDis), the first task that unifies corpus-level retrieval with pixel-level grounding. Given a free-form description, a ReSeDis model must decide whether the queried object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
