High-Quality Visually-Guided Sound Separation from Diverse Categories

Chao Huang; Susan Liang; Yapeng Tian; Anurag Kumar; Chenliang Xu

arXiv:2308.00122·cs.CV·October 14, 2024·2 cites

High-Quality Visually-Guided Sound Separation from Diverse Categories

Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu

PDF

Open Access

TL;DR

DAVIS introduces a diffusion-based generative framework for high-quality audio-visual sound separation, outperforming existing methods by synthesizing sounds directly conditioned on visual and audio cues.

Contribution

The paper presents DAVIS, a novel diffusion model approach for audio-visual separation that surpasses traditional mask-based methods in quality and diversity.

Findings

01

Outperforms state-of-the-art methods on AVE and MUSIC datasets.

02

Generates higher-quality separated sounds with diverse categories.

03

Leverages a diffusion model conditioned on visual and audio data.

Abstract

We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression problem, achieving significant progress. However, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse sound categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Convolution · Diffusion · Max Pooling · U-Net