High-Quality Visually-Guided Sound Separation from Diverse Categories
Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu

TL;DR
DAVIS introduces a diffusion-based generative framework for high-quality audio-visual sound separation, outperforming existing methods by synthesizing sounds directly conditioned on visual and audio cues.
Contribution
The paper presents DAVIS, a novel diffusion model approach for audio-visual separation that surpasses traditional mask-based methods in quality and diversity.
Findings
Outperforms state-of-the-art methods on AVE and MUSIC datasets.
Generates higher-quality separated sounds with diverse categories.
Leverages a diffusion model conditioned on visual and audio data.
Abstract
We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression problem, achieving significant progress. However, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse sound categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Convolution · Diffusion · Max Pooling · U-Net
