Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
Chao Hao, Jun Xu, Ji Du, Shuo Ye, Ziyue Qiao, Xiaodong Cun, Guangcong Wang, Xubin Zheng, Zitong Yu

TL;DR
Seg-Agent introduces a training-free, multimodal reasoning framework for language-guided segmentation that interacts visually with images, achieving state-of-the-art performance without model training.
Contribution
It pioneers explicit multimodal chain-of-reasoning with visual feedback, eliminating the need for training while maintaining high segmentation accuracy.
Findings
Achieves comparable performance to training-based methods
Introduces the Various-LangSeg benchmark for diverse segmentation tasks
Demonstrates robustness across multiple scenarios
Abstract
Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off-the-shelf MLLMs, these methods often rely on extensive training on large-scale datasets to achieve satisfactory accuracy. While recent advances have introduced reasoning mechanisms to improve performance, they predominantly operate within the textual domain, performing chain-of-thought reasoning solely based on abstract text representations without direct visual feedback. In this paper, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
