R2SM: Referring and Reasoning for Selective Masks
Yu-Lin Shih, Wei-En Tai, Cheng Sun, Yu-Chiang Frank Wang, Hwann-Tzong Chen

TL;DR
The paper introduces R2SM, a new task and dataset for vision-language models to perform selective segmentation based on user intent, distinguishing between visible and complete object masks using natural language prompts.
Contribution
It proposes the R2SM task and dataset, enabling models to interpret and generate modal or amodal masks according to user-specified intent in natural language.
Findings
Models can be trained to distinguish modal and amodal masks based on prompts.
The R2SM dataset supports fine-tuning and evaluating intent-aware segmentation.
Benchmark results show improved understanding of user intent in segmentation tasks.
Abstract
We introduce a new task, Referring and Reasoning for Selective Masks (R2SM), which extends text-guided segmentation by incorporating mask-type selection driven by user intent. This task challenges vision-language models to determine whether to generate a modal (visible) or amodal (complete) segmentation mask based solely on natural language prompts. To support the R2SM task, we present the R2SM dataset, constructed by augmenting annotations of COCOA-cls, D2SA, and MUVA. The R2SM dataset consists of both modal and amodal text queries, each paired with the corresponding ground-truth mask, enabling model finetuning and evaluation for the ability to segment images as per user intent. Specifically, the task requires the model to interpret whether a given prompt refers to only the visible part of an object or to its complete shape, including occluded regions, and then produce the appropriate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
