DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization

Geonyoung Lee; Geonhee Han; Paul Hongsuck Seo

arXiv:2506.02858·eess.AS·June 26, 2025

DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization

Geonyoung Lee, Geonhee Han, Paul Hongsuck Seo

PDF

Open Access

TL;DR

This paper introduces DGMO, a training-free, zero-shot audio source separation method that repurposes pretrained diffusion models through test-time mask optimization, eliminating the need for task-specific training.

Contribution

The paper presents a novel framework that leverages pretrained diffusion models for audio separation without additional training, expanding their application to zero-shot source separation.

Findings

01

Achieves competitive separation performance without task-specific training.

02

Identifies limitations of naive diffusion model adaptations for audio separation.

03

Proposes a test-time mask optimization method that improves separation accuracy.

Abstract

Language-queried Audio Source Separation (LASS) enables open-vocabulary sound separation via natural language queries. While existing methods rely on task-specific training, we explore whether pretrained diffusion models, originally designed for audio generation, can inherently perform separation without further training. In this study, we introduce a training-free framework leveraging generative priors for zero-shot LASS. Analyzing naive adaptations, we identify key limitations arising from modality-specific challenges. To address these issues, we propose Diffusion-Guided Mask Optimization (DGMO), a test-time optimization framework that refines spectrogram masks for precise, input-aligned separation. Our approach effectively repurposes pretrained diffusion models for source separation, achieving competitive performance without task-specific supervision. This work expands the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsDiffusion