MeanFlow-TSE: One-Step Generative Target Speaker Extraction with Mean Flow
Riki Shimizu, Xilin Jiang, Nima Mesgarani

TL;DR
MeanFlow-TSE introduces a one-step generative framework for target speaker extraction that achieves high-quality separation in real-time, outperforming multi-step diffusion models in efficiency and accuracy.
Contribution
It presents a novel mean-flow-based one-step generative approach for TSE, eliminating the need for iterative sampling and enabling real-time application.
Findings
Outperforms existing diffusion- and flow-matching models in separation quality
Requires only a single inference step, enabling real-time processing
Demonstrates effectiveness on the Libri2Mix corpus
Abstract
Target speaker extraction (TSE) aims to isolate a desired speaker's voice from a multi-speaker mixture using auxiliary information such as a reference utterance. Although recent advances in diffusion and flow-matching models have improved TSE performance, these methods typically require multi-step sampling, which limits their practicality in low-latency settings. In this work, we propose MeanFlow-TSE, a one-step generative TSE framework trained with mean-flow objectives, enabling fast and high-quality generation without iterative refinement. Building on the AD-FlowTSE paradigm, our method defines a flow between the background and target source that is governed by the mixing ratio (MR). Experiments on the Libri2Mix corpus show that our approach outperforms existing diffusion- and flow-matching-based TSE models in separation quality and perceptual metrics while requiring only a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
