Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures
Gen\'is Plaja-Roglans, Yun-Ning Hung, Xavier Serra, Igor Pereira

TL;DR
This paper introduces a diffusion model for singing voice separation from music mixtures, demonstrating improved performance, controllable output quality, and the ability for user refinement, advancing generative approaches in source separation.
Contribution
The work presents a novel diffusion-based method for singing voice separation conditioned on music mixtures, outperforming prior generative models and enabling user-controlled output refinement.
Findings
Achieves competitive scores against non-generative baselines.
Enables user control over quality-efficiency trade-off.
Provides an ablation study on sampling algorithm parameters.
Abstract
Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a mixture to extract the target sources, the flexibility and generalization capabilities of generative diffusion models are giving rise to a novel class of solutions for this complicated task. In this work, we explore singing voice separation from real music recordings using a diffusion model which is trained to generate the solo vocals conditioned on the corresponding mixture. Our approach improves upon prior generative systems and achieves competitive objective scores against non-generative baselines when trained with supplementary data. The iterative nature of diffusion sampling enables the user to control the quality-efficiency trade-off, and also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing
