SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction

Zhixiong Zhang; Yizhuo Li; Shuangrui Ding; Yuhang Zang; Shengyuan Ding; Long Xing; Yibin Wang; Qiaosheng Zhang; Jiaqi Wang

arXiv:2605.20110·cs.CV·May 20, 2026

SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction

Zhixiong Zhang, Yizhuo Li, Shuangrui Ding, Yuhang Zang, Shengyuan Ding, Long Xing, Yibin Wang, Qiaosheng Zhang, Jiaqi Wang

PDF

1 Models 1 Datasets

TL;DR

SetCon introduces a set-level concept prediction framework for open-ended referring segmentation, improving multi-target grounding accuracy and enabling transfer to video tasks.

Contribution

It reformulates referring segmentation as set-level concept prediction using LVLMs and hierarchical semantic decomposition, with a new annotation pipeline for supervision.

Findings

01

Achieves state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE)

02

Transfers concept interface to video, setting new state-of-the-art on seven benchmarks (+10.9 J&F on MeViS)

03

Supports open-ended, multi-target, and cross-category referring segmentation tasks.

Abstract

Referring segmentation grounds natural-language queries to pixel-level masks, but extending it to complex scenarios with multiple instances, cross-category groups, or open-ended target sets remains challenging. Previous Large Vision Language Model (LVLM)-based methods represent referred targets with one or more special tokens sequentially, treating multiple targets as separate outputs rather than a coherent set and offering little incentive to capture set-level properties such as completeness and mutual exclusivity. We reformulate open-ended referring segmentation as explicit set-level concept prediction and propose Set-Concept Segmentation (SetCon), which uses LVLM-generated natural-language concepts, instead of segmentation-specific tokens, as semantic conditions for joint mask-set decoding. A hierarchical semantic decomposition first predicts a shared set-level concept defining the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
rookiexiong/SetCon-8B
model· 83 dl· ♡ 4
83 dl♡ 4

Datasets

rookiexiong/setcon_training_datasets
dataset· 162 dl
162 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.