Dynamic Speech Endpoint Detection with Regression Targets

Dawei Liang; Hang Su; Tarun Singh; Jay Mahadeokar; Shanil Puri; Jiedan; Zhu; Edison Thomaz; Mike Seltzer

arXiv:2210.14252·cs.SD·October 27, 2022

Dynamic Speech Endpoint Detection with Regression Targets

Dawei Liang, Hang Su, Tarun Singh, Jay Mahadeokar, Shanil Puri, Jiedan, Zhu, Edison Thomaz, Mike Seltzer

PDF

Open Access

TL;DR

This paper introduces a regression-based approach for speech end-point detection in voice assistants, allowing dynamic adjustment based on user query context, leading to improved latency and accuracy trade-offs.

Contribution

The paper presents a novel regression model for speech end-point detection that adapts to query context, enhancing performance over traditional classification methods.

Findings

01

Regression-based end-pointing improves latency and accuracy.

02

Pause modeling enhances dynamic detection capabilities.

03

The approach generalizes well across devices.

Abstract

Interactive voice assistants have been widely used as input interfaces in various scenarios, e.g. on smart homes devices, wearables and on AR devices. Detecting the end of a speech query, i.e. speech end-pointing, is an important task for voice assistants to interact with users. Traditionally, speech end-pointing is based on pure classification methods along with arbitrary binary targets. In this paper, we propose a novel regression-based speech end-pointing model, which enables an end-pointer to adjust its detection behavior based on context of user queries. Specifically, we present a pause modeling method and show its effectiveness for dynamic end-pointing. Based on our experiments with vendor-collected smartphone and wearables speech queries, our strategy shows a better trade-off between endpointing latency and accuracy, compared to the traditional classification-based method. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Topic Modeling