AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method
Deshui Miao, Chao Yang, Chao Tian, Guoqing Zhu, Kai Yang, Zhifan Mo, Xin Li

TL;DR
This paper presents AgentRVOS, a multi-agent pipeline for video object segmentation that integrates semantic hypotheses, agent-driven decision-making, and refinement strategies to improve accuracy and robustness.
Contribution
It introduces a novel multi-agent framework that combines dense semantic hypotheses with agent roles for refined video object segmentation.
Findings
Achieved accurate dense grounded understanding in video segmentation.
Effectively handles presence verification and temporal search.
Improves mask refinement through collaborative agent strategies.
Abstract
This report describes a Ref-VOS pipeline centered on Sa2VA and organized with explicit agent roles. The key idea is that Sa2VA should provide the first dense semantic hypothesis, while an agent loop decides whether that hypothesis should be accepted, revised, or refined. The pipeline starts with a target-presence judgment stage. If the referred object does not exist in the video, the system directly outputs zero masks. Otherwise, Sa2VA receives the video and referring prompt and produces a coarse mask trajectory over the full video. This trajectory is treated as a semantic prior rather than a final answer. A planner agent decomposes the query, temporal partition agents identify informative blocks, scout agents search for anchor frames, and refinement agents convert reliable Sa2VA masks into boxes and points for SAM3 propagation. A critic scores candidate trajectories, a reflection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
