SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction
Zhixiong Zhang, Yizhuo Li, Shuangrui Ding, Yuhang Zang, Shengyuan Ding, Long Xing, Yibin Wang, Qiaosheng Zhang, Jiaqi Wang

TL;DR
SetCon introduces a set-level concept prediction framework for open-ended referring segmentation, improving multi-target grounding accuracy and enabling transfer to video tasks.
Contribution
It reformulates referring segmentation as set-level concept prediction using LVLMs and hierarchical semantic decomposition, with a new annotation pipeline for supervision.
Findings
Achieves state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE)
Transfers concept interface to video, setting new state-of-the-art on seven benchmarks (+10.9 J&F on MeViS)
Supports open-ended, multi-target, and cross-category referring segmentation tasks.
Abstract
Referring segmentation grounds natural-language queries to pixel-level masks, but extending it to complex scenarios with multiple instances, cross-category groups, or open-ended target sets remains challenging. Previous Large Vision Language Model (LVLM)-based methods represent referred targets with one or more special tokens sequentially, treating multiple targets as separate outputs rather than a coherent set and offering little incentive to capture set-level properties such as completeness and mutual exclusivity. We reformulate open-ended referring segmentation as explicit set-level concept prediction and propose Set-Concept Segmentation (SetCon), which uses LVLM-generated natural-language concepts, instead of segmentation-specific tokens, as semantic conditions for joint mask-set decoding. A hierarchical semantic decomposition first predicts a shared set-level concept defining the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
