Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios
Yiming Yang, Guangyong Wang, Haixin Guan, Yanhua Long

TL;DR
This paper introduces Enroll-on-Wakeup, a novel target speech extraction framework that uses naturally captured wake-word segments as enrollment references, enabling seamless interaction without pre-recorded speech, and evaluates its performance with various models and augmentation techniques.
Contribution
It proposes a new EoW framework that leverages wake-word segments for TSE, eliminating the need for pre-collected enrollment speech, and systematically evaluates its effectiveness in real noisy environments.
Findings
TSE models show performance degradation in EoW scenarios.
TTS-based enrollment augmentation improves listening experience.
Gaps remain in speech recognition accuracy despite enhancements.
Abstract
Target speech extraction (TSE) typically relies on pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interaction. In this paper, we propose Enroll-on-Wakeup (EoW), a novel framework where the wake-word segment, captured naturally during human-machine interaction, is automatically utilized as the enrollment reference. This eliminates the need for pre-collected speech to enable a seamless experience. We perform the first systematic study of EoW-TSE, evaluating advanced discriminative and generative models under real diverse acoustic conditions. Given the short and noisy nature of wake-word segments, we investigate enrollment augmentation using LLM-based TTS. Results show that while current TSE models face performance degradation in EoW-TSE, TTS-based assistance significantly enhances the listening experience, though gaps…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
