APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track
Deshui Miao, Yameng Gu, Chao Yang, Xin Li, Haijun Zhang, Ming-Hsuan Yang

TL;DR
This paper introduces APRVOS, a novel audio-aware Ref-VOS pipeline that converts spoken queries into text, verifies visual presence, and refines segmentation, achieving top results in the MEVIS_Audio challenge.
Contribution
It presents a staged framework combining speech transcription, visual existence verification, and agent-guided refinement for audio-conditioned video segmentation.
Findings
Achieved 1st place in the 5th PVUW MeViS-Audio Track.
Demonstrated improved segmentation accuracy over standard pipelines.
Validated the effectiveness of staged audio-to-video segmentation approach.
Abstract
This report presents an Audio-aware Referring Video Object Segmentation (Ref-VOS) pipeline tailored to the MEVIS\_Audio setting, where the referring expression is provided in spoken form rather than as clean text. Compared with a standard Sa2VA-based Ref-VOS pipeline, the proposed system introduces two additional front-end stages: speech transcription and visual existence verification. Specifically, we first employ VibeVoice-ASR to convert long-form spoken input into a structured textual transcript. Since audio-derived queries are inherently noisy and may describe entities that are not visually present in the video, we then introduce an Omni-based judgment module to determine whether the transcribed target can be grounded in the visual content. If the target is judged to be absent, the pipeline terminates early and outputs all-zero masks. Otherwise, the transcript is transformed into a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
