3rd Place of MeViS-Audio Track of the 5th PVUW: VIRST-Audio

Jihwan Hong; Jaeyoung Do

arXiv:2603.23126·cs.CV·March 25, 2026

3rd Place of MeViS-Audio Track of the 5th PVUW: VIRST-Audio

Jihwan Hong, Jaeyoung Do

PDF

Open Access

TL;DR

VIRST-Audio is a framework for audio-based video object segmentation that converts audio to text and uses text-based supervision, with a gating mechanism to improve robustness, achieving competitive results in a challenge.

Contribution

It introduces a novel approach that leverages text-based reasoning for audio-driven segmentation without audio-specific training, enhancing robustness with an existence-aware gating mechanism.

Findings

01

Achieved 3rd place in PVUW challenge

02

Demonstrated effective transfer from text to audio scenarios

03

Improved segmentation stability with gating mechanism

Abstract

Audio-based Referring Video Object Segmentation (ARVOS) requires grounding audio queries into pixel-level object masks over time, posing challenges in bridging acoustic signals with spatio-temporal visual representations. In this report, we present VIRST-Audio, a practical framework built upon a pretrained RVOS model integrated with a vision-language architecture. Instead of relying on audio-specific training, we convert input audio into text using an ASR module and perform segmentation using text-based supervision, enabling effective transfer from text-based reasoning to audio-driven scenarios. To improve robustness, we further incorporate an existence-aware gating mechanism that estimates whether the referred target object is present in the video and suppresses predictions when it is absent, reducing hallucinated masks and stabilizing segmentation behavior. We evaluate our approach on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Multimodal Machine Learning Applications · Video Analysis and Summarization