Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval

Li-Cheng Shen; Jih-Kang Hsieh; Wei-Hua Li; Chu-Song Chen

arXiv:2506.22864·cs.CV·July 1, 2025

Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval

Li-Cheng Shen, Jih-Kang Hsieh, Wei-Hua Li, Chu-Song Chen

PDF

Open Access

TL;DR

This paper introduces Mask-aware TIR (MaTIR), a unified framework combining efficient text-to-image retrieval with precise object segmentation, leveraging large language models and region-level embeddings for improved accuracy and interpretability.

Contribution

The paper proposes a novel two-stage framework that unifies TIR and RES, utilizing SAM 2, Alpha-CLIP, and MLLM for scalable, accurate retrieval and segmentation.

Findings

01

Significant improvements in retrieval accuracy on COCO and D$^3$ datasets.

02

Enhanced segmentation quality with efficient object localization.

03

Effective integration of large language models for reranking and grounding.

Abstract

Text-to-image retrieval (TIR) aims to find relevant images based on a textual query, but existing approaches are primarily based on whole-image captions and lack interpretability. Meanwhile, referring expression segmentation (RES) enables precise object localization based on natural language descriptions but is computationally expensive when applied across large image collections. To bridge this gap, we introduce Mask-aware TIR (MaTIR), a new task that unifies TIR and RES, requiring both efficient image search and accurate object segmentation. To address this task, we propose a two-stage framework, comprising a first stage for segmentation-aware image retrieval and a second stage for reranking and object grounding with a multimodal large language model (MLLM). We leverage SAM 2 to generate object masks and Alpha-CLIP to extract region-level embeddings offline at first, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsSegment Anything Model