GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation

Sandesh Hegde; Jaison Saji Chacko; Debarshi Banerjee; Uma Mahesh

arXiv:2602.09701·cs.CV·February 11, 2026

GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation

Sandesh Hegde, Jaison Saji Chacko, Debarshi Banerjee, Uma Mahesh

PDF

Open Access

TL;DR

This paper introduces GenSeg-R1, a vision-language model that reasons about scenes and generates spatial prompts for fine-grained image segmentation, achieving state-of-the-art results without supervised reasoning annotations.

Contribution

It presents a novel decoupled reasoning and segmentation pipeline using RL fine-tuning of large VL models and introduces a variant trained with a mask quality reward.

Findings

01

Achieves 0.7127 cIoU on RefCOCOg validation, outperforming baselines.

02

GenSeg-R1-G attains 76.69% target mIoU on GRefCOCO.

03

Surpasses previous models in fine-grained referring segmentation accuracy.

Abstract

We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline. A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts: a bounding box plus two interior keypoints for every referred instance. A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks. Within our GenSeg-R1 framework we finetune Qwen3-VL models (4B and 8B parameters) using Group Relative Policy Optimization (GRPO), requiring no supervised reasoning-chain annotations. On RefCOCOg validation our best model (GenSeg-R1-8B) achieves 0.7127 cIoU and 0.7382 mIoU, substantially outperforming the corresponding Qwen3-VL Instruct baselines (+15.3 and +21.9 points, respectively) and surpassing Seg-Zero-7B [3] by +3.3 cIoU under identical evaluation. We further introduce GenSeg-R1-G, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Topic Modeling