2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA
Zhiyu Wang, Xudong Kang, Shutao Li

TL;DR
This paper introduces ASR-SaSaSa2VA, a resource-efficient audio-guided video segmentation framework that converts audio to text for improved robustness and leverages pre-trained models for pixel-level object segmentation.
Contribution
The paper proposes a novel method combining speech recognition and text-based segmentation models to improve efficiency and robustness in audio-guided video object segmentation.
Findings
Achieved 80.7 score in the PVUW Challenge, ranking second.
Effectively filters irrelevant audio using a fine-tuned MLLM.
Utilizes pre-trained text-based segmentation models for pixel-level predictions.
Abstract
Audio-based video object segmentation aims to locate and segment objects in videos conditioned on audio cues, requiring precise understanding of both appearance and motion. Recent audio-driven video segmentation methods extend MLLMs by fusing audio and visual features for end-to-end localization. Despite their promise, these approaches are computationally intensive, struggle with aligning temporal audio cues to dynamic video content, and depend on large paired audio-video datasets. To address these challenges, we present ASR-SaSaSa2VA, a resource-efficient framework for audio-guided video segmentation. The key idea is to convert audio inputs into textual motion descriptions via automatic speech recognition (ASR) models and then leverage pre-trained text-based referring video segmentation models (e.g., SaSaSa2VA) for pixel-level predictions. To further enhance robustness, we incorporate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
