Device-directed Utterance Detection

Sri Harish Mallidi; Roland Maas; Kyle Goehner; Ariya Rastrow; Spyros; Matsoukas; Bj\"orn Hoffmeister

arXiv:1808.02504·cs.CL·August 9, 2018·1 cites

Device-directed Utterance Detection

Sri Harish Mallidi, Roland Maas, Kyle Goehner, Ariya Rastrow, Spyros, Matsoukas, Bj\"orn Hoffmeister

PDF

Open Access

TL;DR

This paper presents a neural network-based classifier that effectively distinguishes device-directed speech from background noise, improving voice assistant interactions by reducing false triggers and enabling more natural follow-up queries.

Contribution

The work introduces a novel combination of LSTM and DNN models trained on acoustic and ASR features for device-directed utterance detection, achieving significant error rate reduction.

Findings

01

Achieved a final EER of 5.2% with combined features.

02

ASR decoder features alone yielded 9.3% EER.

03

Combining acoustic and ASR features improved performance by 44%.

Abstract

In this work, we propose a classifier for distinguishing device-directed queries from background speech in the context of interactions with voice assistants. Applications include rejection of false wake-ups or unintended interactions as well as enabling wake-word free follow-up queries. Consider the example interaction: $" C o m p u t er, pl a y m u s i c ", " C o m p u t er, r e d u ce t h e v o l u m e "$ . In this interaction, the user needs to repeat the wake-word ( $C o m p u t er$ ) for the second query. To allow for more natural interactions, the device could immediately re-enter listening state after the first query (without wake-word repetition) and accept or reject a potential follow-up as device-directed or background speech. The proposed model consists of two long short-term memory (LSTM) neural networks trained on acoustic features and automatic speech recognition (ASR) 1-best hypotheses, respectively. A…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing