3rd Place of MeViS-Audio Track of the 5th PVUW: VIRST-Audio
Jihwan Hong, Jaeyoung Do

TL;DR
VIRST-Audio is a framework for audio-based video object segmentation that converts audio to text and uses text-based supervision, with a gating mechanism to improve robustness, achieving competitive results in a challenge.
Contribution
It introduces a novel approach that leverages text-based reasoning for audio-driven segmentation without audio-specific training, enhancing robustness with an existence-aware gating mechanism.
Findings
Achieved 3rd place in PVUW challenge
Demonstrated effective transfer from text to audio scenarios
Improved segmentation stability with gating mechanism
Abstract
Audio-based Referring Video Object Segmentation (ARVOS) requires grounding audio queries into pixel-level object masks over time, posing challenges in bridging acoustic signals with spatio-temporal visual representations. In this report, we present VIRST-Audio, a practical framework built upon a pretrained RVOS model integrated with a vision-language architecture. Instead of relying on audio-specific training, we convert input audio into text using an ASR module and perform segmentation using text-based supervision, enabling effective transfer from text-based reasoning to audio-driven scenarios. To improve robustness, we further incorporate an existence-aware gating mechanism that estimates whether the referred target object is present in the video and suppresses predictions when it is absent, reducing hallucinated masks and stabilizing segmentation behavior. We evaluate our approach on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Multimodal Machine Learning Applications · Video Analysis and Summarization
