Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models

Bang Zeng; Beilong Tang; Wang Xiang; Ming Li

arXiv:2601.06006·eess.AS·May 21, 2026

Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models

Bang Zeng, Beilong Tang, Wang Xiang, Ming Li

PDF

TL;DR

This paper introduces a novel discriminative-generative two-stage framework for target speaker extraction that combines interference suppression with high-quality speech reconstruction, improving perceptual quality and naturalness.

Contribution

The paper proposes a new two-stage framework integrating discriminative and generative models for TSE, addressing limitations of existing systems in speech quality and controllability.

Findings

01

Achieves better balance among perceptual quality, intelligibility, and speaker consistency.

02

Outperforms purely discriminative or generative baselines on TSE and SE benchmarks.

03

Demonstrates effectiveness of collaboration strategies like joint fine-tuning and regularization.

Abstract

Target speaker extraction (TSE) aims to recover the speech of a desired speaker from a mixture given a short enrollment utterance, while speech enhancement (SE) focuses on improving speech quality under noisy conditions. Most existing TSE and SE systems are based on discriminative modeling and have shown strong interference suppression ability, but they often remain limited in perceptual quality and naturalness. To address this issue, we first introduce LauraTSE, a generative TSE model built on an autoregressive decoder-only language model. Although generative modeling is promising for quality enhancement, purely generative TSE may suffer from hallucination, content drift, and limited controllability in complex acoustic conditions. We therefore propose a discriminative-generative two-stage framework, where a discriminative front-end first produces target-related representations with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing