Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation
Yun-Ning (Amy) Hung, Richard Vogl, Filip Korzeniowski, Igor Pereira

TL;DR
This paper introduces Diff-VS, a novel diffusion-based U-Net model for vocal separation that matches traditional methods on metrics and offers high perceptual quality, promoting generative approaches in music source separation.
Contribution
The paper presents a new generative diffusion model with an improved U-Net architecture tailored for audio, achieving competitive results in vocals separation.
Findings
Matches discriminative baselines on objective metrics
Achieves perceptual quality comparable to state-of-the-art systems
Encourages broader use of generative methods in music source separation
Abstract
While diffusion models are best known for their performance in generative tasks, they have also been successfully applied to many other tasks, including audio source separation. However, current generative approaches to music source separation often underperform on standard objective metrics. In this paper, we address this issue by introducing a novel generative vocal separation model based on the Elucidated Diffusion Model (EDM) framework. Our model processes complex short-time Fourier transform spectrograms and employs an improved U-Net architecture based on music-informed design choices. Our approach matches discriminative baselines on objective metrics and achieves perceptual quality comparable to state-of-the-art systems, as assessed by proxy subjective metrics. We hope these results encourage broader exploration of generative methods for music source separation
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
