BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation   Knowledge

Chen Liu; Peike Li; Hu Zhang; Lincheng Li; Zi Huang; Dadong Wang; and; Xin Yu

arXiv:2308.10175·cs.CV·August 22, 2023

BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge

Chen Liu, Peike Li, Hu Zhang, Lincheng Li, Zi Huang, Dadong Wang, and, Xin Yu

PDF

Open Access

TL;DR

This paper introduces BAVS, a two-stage framework that improves audio-visual segmentation by integrating foundation knowledge to effectively distinguish real sound sources from background noise and off-screen sounds.

Contribution

The proposed BAVS method explicitly models audio-visual correspondences and incorporates a hierarchical semantic integration strategy to enhance sound source localization in noisy environments.

Findings

01

Outperforms existing methods on AVS datasets

02

Effectively handles background noise and off-screen sounds

03

Improves sound localization accuracy in real-world scenarios

Abstract

Given an audio-visual pair, audio-visual segmentation (AVS) aims to locate sounding sources by predicting pixel-wise maps. Previous methods assume that each sound component in an audio signal always has a visual counterpart in the image. However, this assumption overlooks that off-screen sounds and background noise often contaminate the audio recordings in real-world scenarios. They impose significant challenges on building a consistent semantic mapping between audio and visual signals for AVS models and thus impede precise sound localization. In this work, we propose a two-stage bootstrapping audio-visual segmentation framework by incorporating multi-modal foundation knowledge. In a nutshell, our BAVS is designed to eliminate the interference of background noise or off-screen sounds in segmentation by establishing the audio-visual correspondences in an explicit manner. In the first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Hearing Loss and Rehabilitation