Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios

Yiming Yang; Guangyong Wang; Haixin Guan; Yanhua Long

arXiv:2602.15519·eess.AS·February 25, 2026

Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios

Yiming Yang, Guangyong Wang, Haixin Guan, Yanhua Long

PDF

Open Access

TL;DR

This paper introduces Enroll-on-Wakeup, a novel target speech extraction framework that uses naturally captured wake-word segments as enrollment references, enabling seamless interaction without pre-recorded speech, and evaluates its performance with various models and augmentation techniques.

Contribution

It proposes a new EoW framework that leverages wake-word segments for TSE, eliminating the need for pre-collected enrollment speech, and systematically evaluates its effectiveness in real noisy environments.

Findings

01

TSE models show performance degradation in EoW scenarios.

02

TTS-based enrollment augmentation improves listening experience.

03

Gaps remain in speech recognition accuracy despite enhancements.

Abstract

Target speech extraction (TSE) typically relies on pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interaction. In this paper, we propose Enroll-on-Wakeup (EoW), a novel framework where the wake-word segment, captured naturally during human-machine interaction, is automatically utilized as the enrollment reference. This eliminates the need for pre-collected speech to enable a seamless experience. We perform the first systematic study of EoW-TSE, evaluating advanced discriminative and generative models under real diverse acoustic conditions. Given the short and noisy nature of wake-word segments, we investigate enrollment augmentation using LLM-based TTS. Results show that while current TSE models face performance degradation in EoW-TSE, TTS-based assistance significantly enhances the listening experience, though gaps…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems