DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization
Geonyoung Lee, Geonhee Han, Paul Hongsuck Seo

TL;DR
This paper introduces DGMO, a training-free, zero-shot audio source separation method that repurposes pretrained diffusion models through test-time mask optimization, eliminating the need for task-specific training.
Contribution
The paper presents a novel framework that leverages pretrained diffusion models for audio separation without additional training, expanding their application to zero-shot source separation.
Findings
Achieves competitive separation performance without task-specific training.
Identifies limitations of naive diffusion model adaptations for audio separation.
Proposes a test-time mask optimization method that improves separation accuracy.
Abstract
Language-queried Audio Source Separation (LASS) enables open-vocabulary sound separation via natural language queries. While existing methods rely on task-specific training, we explore whether pretrained diffusion models, originally designed for audio generation, can inherently perform separation without further training. In this study, we introduce a training-free framework leveraging generative priors for zero-shot LASS. Analyzing naive adaptations, we identify key limitations arising from modality-specific challenges. To address these issues, we propose Diffusion-Guided Mask Optimization (DGMO), a test-time optimization framework that refines spectrogram masks for precise, input-aligned separation. Our approach effectively repurposes pretrained diffusion models for source separation, achieving competitive performance without task-specific supervision. This work expands the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
MethodsDiffusion
