Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation

Yun-Ning (Amy) Hung; Richard Vogl; Filip Korzeniowski; Igor Pereira

arXiv:2604.01120·eess.AS·April 24, 2026

Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation

Yun-Ning (Amy) Hung, Richard Vogl, Filip Korzeniowski, Igor Pereira

PDF

TL;DR

This paper introduces Diff-VS, a novel diffusion-based U-Net model for vocal separation that matches traditional methods on metrics and offers high perceptual quality, promoting generative approaches in music source separation.

Contribution

The paper presents a new generative diffusion model with an improved U-Net architecture tailored for audio, achieving competitive results in vocals separation.

Findings

01

Matches discriminative baselines on objective metrics

02

Achieves perceptual quality comparable to state-of-the-art systems

03

Encourages broader use of generative methods in music source separation

Abstract

While diffusion models are best known for their performance in generative tasks, they have also been successfully applied to many other tasks, including audio source separation. However, current generative approaches to music source separation often underperform on standard objective metrics. In this paper, we address this issue by introducing a novel generative vocal separation model based on the Elucidated Diffusion Model (EDM) framework. Our model processes complex short-time Fourier transform spectrograms and employs an improved U-Net architecture based on music-informed design choices. Our approach matches discriminative baselines on objective metrics and achieves perceptual quality comparable to state-of-the-art systems, as assessed by proxy subjective metrics. We hope these results encourage broader exploration of generative methods for music source separation

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.