Listen to Extract: Onset-Prompted Target Speaker Extraction

Pengjie Shen; Kangrui Chen; Shulin He; Pengru Chen; Shuqi Yuan; He Kong; Xueliang Zhang; Zhong-Qiu Wang

arXiv:2505.05114·eess.AS·November 6, 2025

Listen to Extract: Onset-Prompted Target Speaker Extraction

Pengjie Shen, Kangrui Chen, Shulin He, Pengru Chen, Shuqi Yuan, He Kong, Xueliang Zhang, Zhong-Qiu Wang

PDF

Open Access

TL;DR

This paper introduces LExt, a simple yet effective monaural target speaker extraction method that concatenates an enrollment utterance to the mixture to create an artificial onset, guiding neural networks to accurately extract the target speaker.

Contribution

The paper presents a novel, straightforward approach for target speaker extraction that leverages waveform concatenation to improve neural network performance.

Findings

01

Achieves strong performance on multiple public TSE datasets

02

Outperforms existing methods in target speaker extraction tasks

03

Demonstrates simplicity and effectiveness of the approach

Abstract

We propose listen to extract (LExt), a highly-effective while extremely-simple algorithm for monaural target speaker extraction (TSE). Given an enrollment utterance of a target speaker, LExt aims at extracting the target speaker from the speaker's mixed speech with other speakers. For each mixture, LExt concatenates an enrollment utterance of the target speaker to the mixture signal at the waveform level, and trains deep neural networks (DNN) to extract the target speech based on the concatenated mixture signal. The rationale is that, this way, an artificial speech onset is created for the target speaker and it could prompt the DNN (a) which speaker is the target to extract; and (b) spectral-temporal patterns of the target speaker that could help extraction. This simple approach produces strong TSE performance on multiple public TSE datasets including WSJ0-2mix, WHAM! and WHAMR!.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing