Dynamic Speech Endpoint Detection with Regression Targets
Dawei Liang, Hang Su, Tarun Singh, Jay Mahadeokar, Shanil Puri, Jiedan, Zhu, Edison Thomaz, Mike Seltzer

TL;DR
This paper introduces a regression-based approach for speech end-point detection in voice assistants, allowing dynamic adjustment based on user query context, leading to improved latency and accuracy trade-offs.
Contribution
The paper presents a novel regression model for speech end-point detection that adapts to query context, enhancing performance over traditional classification methods.
Findings
Regression-based end-pointing improves latency and accuracy.
Pause modeling enhances dynamic detection capabilities.
The approach generalizes well across devices.
Abstract
Interactive voice assistants have been widely used as input interfaces in various scenarios, e.g. on smart homes devices, wearables and on AR devices. Detecting the end of a speech query, i.e. speech end-pointing, is an important task for voice assistants to interact with users. Traditionally, speech end-pointing is based on pure classification methods along with arbitrary binary targets. In this paper, we propose a novel regression-based speech end-pointing model, which enables an end-pointer to adjust its detection behavior based on context of user queries. Specifically, we present a pause modeling method and show its effectiveness for dynamic end-pointing. Based on our experiments with vendor-collected smartphone and wearables speech queries, our strategy shows a better trade-off between endpointing latency and accuracy, compared to the traditional classification-based method. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Topic Modeling
