MeanFlow-TSE: One-Step Generative Target Speaker Extraction with Mean Flow

Riki Shimizu; Xilin Jiang; Nima Mesgarani

arXiv:2512.18572·eess.AS·December 23, 2025

MeanFlow-TSE: One-Step Generative Target Speaker Extraction with Mean Flow

Riki Shimizu, Xilin Jiang, Nima Mesgarani

PDF

Open Access

TL;DR

MeanFlow-TSE introduces a one-step generative framework for target speaker extraction that achieves high-quality separation in real-time, outperforming multi-step diffusion models in efficiency and accuracy.

Contribution

It presents a novel mean-flow-based one-step generative approach for TSE, eliminating the need for iterative sampling and enabling real-time application.

Findings

01

Outperforms existing diffusion- and flow-matching models in separation quality

02

Requires only a single inference step, enabling real-time processing

03

Demonstrates effectiveness on the Libri2Mix corpus

Abstract

Target speaker extraction (TSE) aims to isolate a desired speaker's voice from a multi-speaker mixture using auxiliary information such as a reference utterance. Although recent advances in diffusion and flow-matching models have improved TSE performance, these methods typically require multi-step sampling, which limits their practicality in low-latency settings. In this work, we propose MeanFlow-TSE, a one-step generative TSE framework trained with mean-flow objectives, enabling fast and high-quality generation without iterative refinement. Building on the AD-FlowTSE paradigm, our method defines a flow between the background and target source that is governed by the mixing ratio (MR). Experiments on the Libri2Mix corpus show that our approach outperforms existing diffusion- and flow-matching-based TSE models in separation quality and perceptual metrics while requiring only a single…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing