Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models
Bang Zeng, Beilong Tang, Wang Xiang, Ming Li

TL;DR
This paper introduces a novel discriminative-generative two-stage framework for target speaker extraction that combines interference suppression with high-quality speech reconstruction, improving perceptual quality and naturalness.
Contribution
The paper proposes a new two-stage framework integrating discriminative and generative models for TSE, addressing limitations of existing systems in speech quality and controllability.
Findings
Achieves better balance among perceptual quality, intelligibility, and speaker consistency.
Outperforms purely discriminative or generative baselines on TSE and SE benchmarks.
Demonstrates effectiveness of collaboration strategies like joint fine-tuning and regularization.
Abstract
Target speaker extraction (TSE) aims to recover the speech of a desired speaker from a mixture given a short enrollment utterance, while speech enhancement (SE) focuses on improving speech quality under noisy conditions. Most existing TSE and SE systems are based on discriminative modeling and have shown strong interference suppression ability, but they often remain limited in perceptual quality and naturalness. To address this issue, we first introduce LauraTSE, a generative TSE model built on an autoregressive decoder-only language model. Although generative modeling is promising for quality enhancement, purely generative TSE may suffer from hallucination, content drift, and limited controllability in complex acoustic conditions. We therefore propose a discriminative-generative two-stage framework, where a discriminative front-end first produces target-related representations with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
