Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models
Seung-jae Lee, Paul Hongsuck Seo

TL;DR
This paper introduces a zero-shot audiovisual segmentation framework that leverages pretrained models across audio, vision, and text modalities to accurately identify sound sources in videos without requiring task-specific training data.
Contribution
It presents a novel method that connects pretrained models across multiple modalities, enabling zero-shot AVS without the need for annotated datasets, which is a significant advancement over traditional supervised approaches.
Findings
Achieves state-of-the-art zero-shot AVS performance
Effectively bridges modality gaps using pretrained models
Demonstrates robustness across multiple datasets
Abstract
Audiovisual segmentation (AVS) aims to identify visual regions corresponding to sound sources, playing a vital role in video understanding, surveillance, and human-computer interaction. Traditional AVS methods depend on large-scale pixel-level annotations, which are costly and time-consuming to obtain. To address this, we propose a novel zero-shot AVS framework that eliminates task-specific training by leveraging multiple pretrained models. Our approach integrates audio, vision, and text representations to bridge modality gaps, enabling precise sound source segmentation without AVS-specific annotations. We systematically explore different strategies for connecting pretrained models and evaluate their efficacy across multiple datasets. Experimental results demonstrate that our framework achieves state-of-the-art zero-shot AVS performance, highlighting the effectiveness of multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Multisensory perception and integration
