The 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLMs Meet SAM3 for Referring Video Object Segmentation

Xusheng He; Canyang Wu; Jinrong Zhang; Weili Guan; Jianlong Wu; Liqiang Nie

arXiv:2604.00404·cs.CV·April 2, 2026

The 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLMs Meet SAM3 for Referring Video Object Segmentation

Xusheng He, Canyang Wu, Jinrong Zhang, Weili Guan, Jianlong Wu, Liqiang Nie

PDF

1 Repo

TL;DR

This paper introduces a fully training-free, multimodal large language model-based pipeline that achieves state-of-the-art results in referring video object segmentation without task-specific fine-tuning.

Contribution

The authors propose a novel, training-free approach combining strong LLMs and SAM3 for video segmentation, outperforming previous methods.

Findings

01

Achieved first place on PVUW 2026 MeViS-Text test set

02

Final score of 0.909064 and J&F score of 0.7897

03

No task-specific fine-tuning required

Abstract

This report presents our winning solution to the 5th PVUW MeViS-Text Challenge. The track studies referring video object segmentation under motion-centric language expressions, where the model must jointly understand appearance, temporal behavior, and object interactions. To address this problem, we build a fully training-free pipeline that combines strong multimodal large language models with SAM3. Our method contains three stages. First, Gemini-3.1 Pro decomposes each target event into instance-level grounding targets, selects the frame where the target is most clearly visible, and generates a discriminative description. Second, SAM3-agent produces a precise seed mask on the selected frame, and the official SAM3 tracker propagates the mask through the whole video. Third, a refinement stage uses Qwen3.5-Plus and behavior-level verification to correct ambiguous or semantically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Moujuruo/MeViSv2_Track_Solution_2026
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.