Training-Free Multi-Step Audio Source Separation
Yongyi Zang, Jingyi Li, Qiuqiang Kong

TL;DR
This paper introduces a training-free multi-step inference method for audio source separation that iteratively improves results by blending previous outputs, outperforming traditional one-step models without additional training.
Contribution
The authors propose a novel inference technique that leverages existing pretrained models for multi-step separation, providing theoretical guarantees and empirical improvements across tasks.
Findings
Consistently outperforms one-step inference in speech and music separation
Achieves performance comparable to training larger models or more data
Improves non-optimized metrics beyond the optimization metric
Abstract
Audio source separation aims to separate a mixture into target sources. Previous audio source separation systems usually conduct one-step inference, which does not fully explore the separation ability of models. In this work, we reveal that pretrained one-step audio source separation models can be leveraged for multi-step separation without additional training. We propose a simple yet effective inference method that iteratively applies separation by optimally blending the input mixture with the previous step's separation result. At each step, we determine the optimal blending ratio by maximizing a metric. We prove that our method always yield improvement over one-step inference, provide error bounds based on model smoothness and metric robustness, and provide theoretical analysis connecting our method to denoising along linear interpolation paths between noise and clean distributions, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
MethodsDiffusion
