RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations
Xingqi He, Yujie Zhang, Shuyong Gao, Wenjie Li, Lingyi Hong, Mingxi Chen, Kaixun Jiang, Jiyuan Fu, Wenqiang Zhang

TL;DR
RSAgent is a novel agentic multimodal model that iteratively reasons and refines text-guided segmentation masks through multi-turn tool interactions, significantly improving zero-shot and in-domain performance.
Contribution
It introduces RSAgent, a multi-turn reasoning framework with tool invocation capabilities for improved text-guided segmentation, trained via supervised fine-tuning and reinforcement learning.
Findings
Achieves 66.5% gIoU on ReasonSeg zero-shot test
Reaches 81.5% cIoU on RefCOCOg
Outperforms previous methods by 9% on ReasonSeg
Abstract
Text-guided object segmentation requires both cross-modal reasoning and pixel grounding abilities. Most recent methods treat text-guided segmentation as one-shot grounding, where the model predicts pixel prompts in a single forward pass to drive an external segmentor, which limits verification, refocusing and refinement when initial localization is wrong. To address this limitation, we propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations. RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks. We further build a data pipeline to synthesize multi-turn reasoning segmentation trajectories, and train RSAgent with a two-stage framework: cold-start supervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Topic Modeling
