Streaming on-device detection of device directed speech from voice and touch-based invocation
Ognjen Rudovic, Akanksha Bindal, Vineet Garg, Pramod Simha, Pranay, Dighe, Sachin Kajarekar

TL;DR
This paper introduces a novel streaming on-device detection method for device-directed speech that handles both voice and touch invocations, improving accuracy and efficiency for virtual assistant activation.
Contribution
The paper presents the first streaming approach capable of detecting device-directed speech from multiple invocation types using a new TCN-based decision layer for on-device deployment.
Findings
Streaming TCN outperforms alternatives in accuracy and speed.
All models show minimal accuracy degradation compared to invocation-specific models.
The approach reduces runtime peak-memory by up to 33% compared to LSTM-based methods.
Abstract
When interacting with smart devices such as mobile phones or wearables, the user typically invokes a virtual assistant (VA) by saying a keyword or by pressing a button on the device. However, in many cases, the VA can accidentally be invoked by the keyword-like speech or accidental button press, which may have implications on user experience and privacy. To this end, we propose an acoustic false-trigger-mitigation (FTM) approach for on-device device-directed speech detection that simultaneously handles the voice-trigger and touch-based invocation. To facilitate the model deployment on-device, we introduce a new streaming decision layer, derived using the notion of temporal convolutional networks (TCN) [1], known for their computational efficiency. To the best of our knowledge, this is the first approach that can detect device-directed speech from more than one invocation type in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
