ZeroSep: Separate Anything in Audio with Zero Training

Chao Huang; Yuesheng Ma; Junxuan Huang; Susan Liang; Yunlong Tang; Jing Bi; Wenqiang Liu; Nima Mesgarani; Chenliang Xu

arXiv:2505.23625·cs.SD·May 30, 2025

ZeroSep: Separate Anything in Audio with Zero Training

Chao Huang, Yuesheng Ma, Junxuan Huang, Susan Liang, Yunlong Tang, Jing Bi, Wenqiang Liu, Nima Mesgarani, Chenliang Xu

PDF

TL;DR

ZeroSep demonstrates that pre-trained text-guided audio diffusion models can perform zero-shot source separation without any task-specific training, effectively handling open-set scenarios and outperforming supervised methods.

Contribution

The paper introduces ZeroSep, a novel approach that repurposes pre-trained diffusion models for audio source separation in a zero-shot setting, eliminating the need for labeled data.

Findings

01

ZeroSep achieves competitive separation performance without training.

02

Supports open-set scenarios through rich textual priors.

03

Outperforms supervised methods on multiple benchmarks.

Abstract

Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDiffusion