Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models

Seung-jae Lee; Paul Hongsuck Seo

arXiv:2506.06537·cs.CV·June 10, 2025

Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models

Seung-jae Lee, Paul Hongsuck Seo

PDF

Open Access

TL;DR

This paper introduces a zero-shot audiovisual segmentation framework that leverages pretrained models across audio, vision, and text modalities to accurately identify sound sources in videos without requiring task-specific training data.

Contribution

It presents a novel method that connects pretrained models across multiple modalities, enabling zero-shot AVS without the need for annotated datasets, which is a significant advancement over traditional supervised approaches.

Findings

01

Achieves state-of-the-art zero-shot AVS performance

02

Effectively bridges modality gaps using pretrained models

03

Demonstrates robustness across multiple datasets

Abstract

Audiovisual segmentation (AVS) aims to identify visual regions corresponding to sound sources, playing a vital role in video understanding, surveillance, and human-computer interaction. Traditional AVS methods depend on large-scale pixel-level annotations, which are costly and time-consuming to obtain. To address this, we propose a novel zero-shot AVS framework that eliminates task-specific training by leveraging multiple pretrained models. Our approach integrates audio, vision, and text representations to bridge modality gaps, enabling precise sound source segmentation without AVS-specific annotations. We systematically explore different strategies for connecting pretrained models and evaluate their efficacy across multiple datasets. Experimental results demonstrate that our framework achieves state-of-the-art zero-shot AVS performance, highlighting the effectiveness of multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Multisensory perception and integration