Training-Free Multi-Step Inference for Target Speaker Extraction

Zhenghai You; Ying Shi; Lantian Li; Dong Wang

arXiv:2603.10921·cs.SD·March 12, 2026

Training-Free Multi-Step Inference for Target Speaker Extraction

Zhenghai You, Ying Shi, Lantian Li, Dong Wang

PDF

Open Access

TL;DR

This paper introduces a training-free, iterative inference method for target speaker extraction that refines estimates without additional training, balancing multiple metrics for practical use.

Contribution

It proposes a novel multi-step inference approach that refines speaker extraction iteratively using a frozen pretrained model, with joint metric optimization for better practical control.

Findings

01

Iterative refinement improves extraction quality when ground truth is available.

02

Joint metric optimization balances intrusive and non-intrusive metrics effectively.

03

The method achieves consistent gains across multiple evaluation metrics.

Abstract

Target speaker extraction (TSE) aims to recover a target speaker's speech from a mixture using a reference utterance as a cue. Most TSE systems adopt conditional auto-encoder architectures with one-step inference. Inspired by test-time scaling, we propose a training-free multi-step inference method that enables iterative refinement with a frozen pretrained model. At each step, new candidates are generated by interpolating the original mixture and the previous estimate, and the best candidate is selected for further refinement until convergence. Experiments show that, when ground-truth target speech is available, optimizing an intrusive metric (SI-SDRi) yields consistent gains across multiple evaluation metrics. Without ground truth, optimizing non-intrusive metrics (UTMOS or SpkSim) improves the corresponding metric but may hurt others. We therefore introduce joint metric optimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing