2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA

Zhiyu Wang; Xudong Kang; Shutao Li

arXiv:2604.23935·cs.CV·April 28, 2026

2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA

Zhiyu Wang, Xudong Kang, Shutao Li

PDF

TL;DR

This paper introduces ASR-SaSaSa2VA, a resource-efficient audio-guided video segmentation framework that converts audio to text for improved robustness and leverages pre-trained models for pixel-level object segmentation.

Contribution

The paper proposes a novel method combining speech recognition and text-based segmentation models to improve efficiency and robustness in audio-guided video object segmentation.

Findings

01

Achieved 80.7 score in the PVUW Challenge, ranking second.

02

Effectively filters irrelevant audio using a fine-tuned MLLM.

03

Utilizes pre-trained text-based segmentation models for pixel-level predictions.

Abstract

Audio-based video object segmentation aims to locate and segment objects in videos conditioned on audio cues, requiring precise understanding of both appearance and motion. Recent audio-driven video segmentation methods extend MLLMs by fusing audio and visual features for end-to-end localization. Despite their promise, these approaches are computationally intensive, struggle with aligning temporal audio cues to dynamic video content, and depend on large paired audio-video datasets. To address these challenges, we present ASR-SaSaSa2VA, a resource-efficient framework for audio-guided video segmentation. The key idea is to convert audio inputs into textual motion descriptions via automatic speech recognition (ASR) models and then leverage pre-trained text-based referring video segmentation models (e.g., SaSaSa2VA) for pixel-level predictions. To further enhance robustness, we incorporate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.