Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

Chao Hao; Jun Xu; Ji Du; Shuo Ye; Ziyue Qiao; Xiaodong Cun; Guangcong Wang; Xubin Zheng; Zitong Yu

arXiv:2605.12953·cs.CV·May 14, 2026

Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

Chao Hao, Jun Xu, Ji Du, Shuo Ye, Ziyue Qiao, Xiaodong Cun, Guangcong Wang, Xubin Zheng, Zitong Yu

PDF

TL;DR

Seg-Agent introduces a training-free, multimodal reasoning framework for language-guided segmentation that interacts visually with images, achieving state-of-the-art performance without model training.

Contribution

It pioneers explicit multimodal chain-of-reasoning with visual feedback, eliminating the need for training while maintaining high segmentation accuracy.

Findings

01

Achieves comparable performance to training-based methods

02

Introduces the Various-LangSeg benchmark for diverse segmentation tasks

03

Demonstrates robustness across multiple scenarios

Abstract

Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off-the-shelf MLLMs, these methods often rely on extensive training on large-scale datasets to achieve satisfactory accuracy. While recent advances have introduced reasoning mechanisms to improve performance, they predominantly operate within the textual domain, performing chain-of-thought reasoning solely based on abstract text representations without direct visual feedback. In this paper, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.