Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
Yuan Yao, Qiushi Yang, Humen Zhong, Jiangning Wei, Yifang Men, Shuai Bai, Miaomiao Cui, Zhibo Yang

TL;DR
Qwen3-VL-Seg is a lightweight, parameter-efficient model that improves open-world referring segmentation by integrating vision-language grounding with pixel-level prediction, outperforming existing methods.
Contribution
It introduces a novel, parameter-efficient framework that decodes MLLM-predicted boxes into dense segmentation masks, along with a new dataset and benchmark for evaluation.
Findings
Strong performance on referring expression segmentation tasks.
Effective out-of-distribution generalization.
Maintains general multimodal capabilities after adaptation.
Abstract
Open-world referring segmentation requires grounding unconstrained language expressions to precise pixel-level regions. Existing multimodal large language models (MLLMs) exhibit strong open-world visual grounding, but their outputs remain limited to sparse bounding-box coordinates and are insufficient for dense visual prediction. Recent MLLM-based segmentation methods either directly predict sparse contour coordinates, struggling to reconstruct continuous object boundaries, or rely on external segmentation foundation models such as the Segment Anything Model (SAM), introducing substantial architectural and deployment overhead. We present Qwen3-VL-Seg, a parameter-efficient framework that treats the MLLM-predicted box as a semantically grounded structural prior and decodes it into pixel-level referring segmentation. At its core, a lightweight box-guided mask decoder combines multi-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
