GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model

Haoyang Li; Xuyi Zhuang; Azmat Adnan; Ye Ni; Wei Rao; Shreyas Gopal; Eng Siong Chng

arXiv:2512.20978·eess.AS·December 25, 2025

GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model

Haoyang Li, Xuyi Zhuang, Azmat Adnan, Ye Ni, Wei Rao, Shreyas Gopal, Eng Siong Chng

PDF

Open Access

TL;DR

GenTSE introduces a two-stage generative language model for target speaker extraction, improving speech quality, intelligibility, and consistency by separating semantic and acoustic modeling and employing advanced training strategies.

Contribution

The paper proposes a novel two-stage decoder-only generative LM approach for TSE that separates semantics and acoustics, and introduces training strategies to enhance decoding stability and output quality.

Findings

01

Outperforms previous LM-based TSE systems in speech quality.

02

Achieves higher intelligibility and speaker consistency.

03

Demonstrates effectiveness on Libri2Mix dataset.

Abstract

Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We present GenTSE, a two-stage decoder-only generative LM approach for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more faithful, content-aligned target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further employ DPO to better align outputs with human perceptual preferences. Experiments on Libri2Mix show that GenTSE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis