Refer to Any Segmentation Mask Group With Vision-Language Prompts

Shengcao Cao; Zijun Wei; Jason Kuen; Kangning Liu; Lingzhi Zhang; Jiuxiang Gu; HyunJoon Jung; Liang-Yan Gui; Yu-Xiong Wang

arXiv:2506.05342·cs.CV·October 20, 2025

Refer to Any Segmentation Mask Group With Vision-Language Prompts

Shengcao Cao, Zijun Wei, Jason Kuen, Kangning Liu, Lingzhi Zhang, Jiuxiang Gu, HyunJoon Jung, Liang-Yan Gui, Yu-Xiong Wang

PDF

Open Access

TL;DR

This paper introduces the ORES task for semantic segmentation based on complex vision-language prompts and proposes the RAS framework, achieving superior results on new and existing segmentation benchmarks.

Contribution

It defines the novel ORES task and develops the RAS framework, integrating multimodal interactions for improved segmentation based on complex prompts.

Findings

01

RAS outperforms existing models on ORES, RES, and GRES tasks.

02

New datasets MaskGroups-2M and MaskGroups-HQ support training and benchmarking.

03

The approach enhances segmentation accuracy with complex multimodal prompts.

Abstract

Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This limitation reduces their effectiveness in applications that require user-friendly interactions driven by vision-language prompts. To bridge this gap, we introduce a novel task of omnimodal referring expression segmentation (ORES). In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. For training and benchmarking ORES models, we create…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems