Open-Vocabulary Audio-Visual Semantic Segmentation

Ruohao Guo; Liao Qu; Dantong Niu; Yanyu Qi; Wenzhen Yue; Ji Shi; Bowei; Xing; Xianghua Ying

arXiv:2407.21721·cs.MM·August 1, 2024·1 cites

Open-Vocabulary Audio-Visual Semantic Segmentation

Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei, Xing, Xianghua Ying

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces the first open-vocabulary audio-visual semantic segmentation framework, OV-AVSS, capable of recognizing both seen and unseen categories in videos by combining audio-visual fusion with large-scale pre-trained models.

Contribution

It extends AVSS to open-world scenarios, proposing a novel framework with a sound source localization module and an open-vocabulary classification module.

Findings

01

Achieves 55.43% mIoU on base categories

02

Achieves 29.14% mIoU on novel categories

03

Outperforms state-of-the-art zero-shot and open-vocabulary methods

Abstract

Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ruohaoguo/ovavss
pytorch

Models

🤗
ruohguo/ovavss
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media

MethodsBalanced Selection