Open-Vocabulary Audio-Visual Semantic Segmentation
Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei, Xing, Xianghua Ying

TL;DR
This paper introduces the first open-vocabulary audio-visual semantic segmentation framework, OV-AVSS, capable of recognizing both seen and unseen categories in videos by combining audio-visual fusion with large-scale pre-trained models.
Contribution
It extends AVSS to open-world scenarios, proposing a novel framework with a sound source localization module and an open-vocabulary classification module.
Findings
Achieves 55.43% mIoU on base categories
Achieves 29.14% mIoU on novel categories
Outperforms state-of-the-art zero-shot and open-vocabulary methods
Abstract
Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media
MethodsBalanced Selection
