Streaming on-device detection of device directed speech from voice and   touch-based invocation

Ognjen Rudovic; Akanksha Bindal; Vineet Garg; Pramod Simha; Pranay; Dighe; Sachin Kajarekar

arXiv:2110.04656·cs.SD·October 12, 2021

Streaming on-device detection of device directed speech from voice and touch-based invocation

Ognjen Rudovic, Akanksha Bindal, Vineet Garg, Pramod Simha, Pranay, Dighe, Sachin Kajarekar

PDF

Open Access

TL;DR

This paper introduces a novel streaming on-device detection method for device-directed speech that handles both voice and touch invocations, improving accuracy and efficiency for virtual assistant activation.

Contribution

The paper presents the first streaming approach capable of detecting device-directed speech from multiple invocation types using a new TCN-based decision layer for on-device deployment.

Findings

01

Streaming TCN outperforms alternatives in accuracy and speed.

02

All models show minimal accuracy degradation compared to invocation-specific models.

03

The approach reduces runtime peak-memory by up to 33% compared to LSTM-based methods.

Abstract

When interacting with smart devices such as mobile phones or wearables, the user typically invokes a virtual assistant (VA) by saying a keyword or by pressing a button on the device. However, in many cases, the VA can accidentally be invoked by the keyword-like speech or accidental button press, which may have implications on user experience and privacy. To this end, we propose an acoustic false-trigger-mitigation (FTM) approach for on-device device-directed speech detection that simultaneously handles the voice-trigger and touch-based invocation. To facilitate the model deployment on-device, we introduce a new streaming decision layer, derived using the notion of temporal convolutional networks (TCN) [1], known for their computational efficiency. To the best of our knowledge, this is the first approach that can detect device-directed speech from more than one invocation type in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing