RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations

Xingqi He; Yujie Zhang; Shuyong Gao; Wenjie Li; Lingyi Hong; Mingxi Chen; Kaixun Jiang; Jiyuan Fu; Wenqiang Zhang

arXiv:2512.24023·cs.CV·January 1, 2026

RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations

Xingqi He, Yujie Zhang, Shuyong Gao, Wenjie Li, Lingyi Hong, Mingxi Chen, Kaixun Jiang, Jiyuan Fu, Wenqiang Zhang

PDF

Open Access

TL;DR

RSAgent is a novel agentic multimodal model that iteratively reasons and refines text-guided segmentation masks through multi-turn tool interactions, significantly improving zero-shot and in-domain performance.

Contribution

It introduces RSAgent, a multi-turn reasoning framework with tool invocation capabilities for improved text-guided segmentation, trained via supervised fine-tuning and reinforcement learning.

Findings

01

Achieves 66.5% gIoU on ReasonSeg zero-shot test

02

Reaches 81.5% cIoU on RefCOCOg

03

Outperforms previous methods by 9% on ReasonSeg

Abstract

Text-guided object segmentation requires both cross-modal reasoning and pixel grounding abilities. Most recent methods treat text-guided segmentation as one-shot grounding, where the model predicts pixel prompts in a single forward pass to drive an external segmentor, which limits verification, refocusing and refinement when initial localization is wrong. To address this limitation, we propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations. RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks. We further build a data pipeline to synthesize multi-turn reasoning segmentation trajectories, and train RSAgent with a two-stage framework: cold-start supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Topic Modeling