Referring Expression Instance Retrieval and A Strong End-to-End Baseline

Xiangzhao Hao; Kuan Zhu; Hongyu Guo; Haiyun Guo; Ning Jiang; Quan Lu; Ming Tang; Jinqiao Wang

arXiv:2506.18246·cs.CV·August 22, 2025

Referring Expression Instance Retrieval and A Strong End-to-End Baseline

Xiangzhao Hao, Kuan Zhu, Hongyu Guo, Haiyun Guo, Ning Jiang, Quan Lu, Ming Tang, Jinqiao Wang

PDF

1 Datasets

TL;DR

This paper introduces REIR, a new task combining image retrieval and object localization from fine-grained natural language descriptions, along with a benchmark and a strong baseline model.

Contribution

It defines the REIR task, creates the REIRCOCO benchmark, and proposes the CLARE model for end-to-end referring expression instance retrieval.

Findings

01

REIRCOCO provides a large-scale benchmark for REIR.

02

CLARE achieves effective retrieval and localization in experiments.

03

The approach outperforms existing methods on the benchmark.

Abstract

Using natural language to query visual information is a fundamental need in real-world applications. Text-Image Retrieval (TIR) retrieves a target image from a gallery based on an image-level description, while Referring Expression Comprehension (REC) localizes a target object within a given image using an instance-level description. However, real-world applications often present more complex demands. Users typically query an instance-level description across a large gallery and expect to receive both relevant image and the corresponding instance location. In such scenarios, TIR struggles with fine-grained descriptions and object-level localization, while REC is limited in its ability to efficiently search large galleries and lacks an effective ranking mechanism. In this paper, we introduce a new task called \textbf{Referring Expression Instance Retrieval (REIR)}, which supports both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

haoxiangzhao/REIRCOCO
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.