R2SM: Referring and Reasoning for Selective Masks

Yu-Lin Shih; Wei-En Tai; Cheng Sun; Yu-Chiang Frank Wang; Hwann-Tzong Chen

arXiv:2506.01795·cs.CV·June 3, 2025

R2SM: Referring and Reasoning for Selective Masks

Yu-Lin Shih, Wei-En Tai, Cheng Sun, Yu-Chiang Frank Wang, Hwann-Tzong Chen

PDF

Open Access

TL;DR

The paper introduces R2SM, a new task and dataset for vision-language models to perform selective segmentation based on user intent, distinguishing between visible and complete object masks using natural language prompts.

Contribution

It proposes the R2SM task and dataset, enabling models to interpret and generate modal or amodal masks according to user-specified intent in natural language.

Findings

01

Models can be trained to distinguish modal and amodal masks based on prompts.

02

The R2SM dataset supports fine-tuning and evaluating intent-aware segmentation.

03

Benchmark results show improved understanding of user intent in segmentation tasks.

Abstract

We introduce a new task, Referring and Reasoning for Selective Masks (R2SM), which extends text-guided segmentation by incorporating mask-type selection driven by user intent. This task challenges vision-language models to determine whether to generate a modal (visible) or amodal (complete) segmentation mask based solely on natural language prompts. To support the R2SM task, we present the R2SM dataset, constructed by augmenting annotations of COCOA-cls, D2SA, and MUVA. The R2SM dataset consists of both modal and amodal text queries, each paired with the corresponding ground-truth mask, enabling model finetuning and evaluation for the ability to segment images as per user intent. Specifically, the task requires the model to interpret whether a given prompt refers to only the visible part of an object or to its complete shape, including occluded regions, and then produce the appropriate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies