Training-Free Multi-Step Inference for Target Speaker Extraction
Zhenghai You, Ying Shi, Lantian Li, Dong Wang

TL;DR
This paper introduces a training-free, iterative inference method for target speaker extraction that refines estimates without additional training, balancing multiple metrics for practical use.
Contribution
It proposes a novel multi-step inference approach that refines speaker extraction iteratively using a frozen pretrained model, with joint metric optimization for better practical control.
Findings
Iterative refinement improves extraction quality when ground truth is available.
Joint metric optimization balances intrusive and non-intrusive metrics effectively.
The method achieves consistent gains across multiple evaluation metrics.
Abstract
Target speaker extraction (TSE) aims to recover a target speaker's speech from a mixture using a reference utterance as a cue. Most TSE systems adopt conditional auto-encoder architectures with one-step inference. Inspired by test-time scaling, we propose a training-free multi-step inference method that enables iterative refinement with a frozen pretrained model. At each step, new candidates are generated by interpolating the original mixture and the previous estimate, and the best candidate is selected for further refinement until convergence. Experiments show that, when ground-truth target speech is available, optimizing an intrusive metric (SI-SDRi) yields consistent gains across multiple evaluation metrics. Without ground truth, optimizing non-intrusive metrics (UTMOS or SpkSim) improves the corresponding metric but may hurt others. We therefore introduce joint metric optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
