OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models
Shengkai Chen, Yifang Yin, Jinming Cao, Shili Xiang, Zhenguang Liu, Roger Zimmermann

TL;DR
OpenAVS introduces a training-free, open-vocabulary audio-visual segmentation method leveraging foundation models and text prompts, enabling effective generalization to unseen scenarios.
Contribution
It is the first to align audio and visual modalities using text as a proxy, enabling training-free open-vocabulary AVS with a flexible, foundation model-based architecture.
Findings
Outperforms existing unsupervised, zero-shot, and few-shot AVS methods.
Achieves approximately 9.4% and 10.9% improvements in mIoU and F-score.
Demonstrates effectiveness on three benchmark datasets.
Abstract
Audio-visual segmentation aims to separate sounding objects from videos by predicting pixel-level masks based on audio signals. Existing methods primarily concentrate on closed-set scenarios and direct audio-visual alignment and fusion, which limits their capability to generalize to new, unseen situations. In this paper, we propose OpenAVS, a novel training-free language-based approach that, for the first time, effectively aligns audio and visual modalities using text as a proxy for open-vocabulary Audio-Visual Segmentation (AVS). Equipped with multimedia foundation models, OpenAVS directly infers masks through 1) audio-to-text prompt generation, 2) LLM-guided prompt translation, and 3) text-to-visual sounding object segmentation. The objective of OpenAVS is to establish a simple yet flexible architecture that relies on the most appropriate foundation models by fully leveraging their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
