Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation

Haichao Jiang; Tianming Liang; Wei-Shi Zheng; Jian-Fang Hu

arXiv:2602.03595·cs.CV·February 9, 2026

Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation

Haichao Jiang, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu

PDF

Open Access

TL;DR

Refer-Agent introduces a collaborative multi-agent system with reasoning and reflection mechanisms for referring video object segmentation, outperforming existing methods and enabling flexible, fine-tuning-free integration of large language models.

Contribution

It proposes a novel multi-agent system with reasoning-reflection processes and adaptive strategies for RVOS, reducing data dependence and improving performance.

Findings

01

Outperforms state-of-the-art methods on five benchmarks.

02

Enables fast integration of new MLLMs without fine-tuning.

03

Demonstrates robustness and flexibility in zero-shot settings.

Abstract

Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large-scale supervised fine-tuning (SFT) of Multi-modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero-shot approaches offer a flexible alternative, their performance remains significantly behind SFT-based methods, due to the straightforward workflow designs. To address these limitations, we propose \textbf{Refer-Agent}, a collaborative multi-agent system with alternating reasoning-reflection mechanisms. This system decomposes RVOS into step-by-step reasoning process. During reasoning, we introduce a Coarse-to-Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Visual Attention and Saliency Detection