Streaming Transformer for Hardware Efficient Voice Trigger Detection and   False Trigger Mitigation

Vineet Garg; Wonil Chang; Siddharth Sigtia; Saurabh Adya; Pramod; Simha; Pranay Dighe; Chandra Dhir

arXiv:2105.06598·eess.AS·May 17, 2021

Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation

Vineet Garg, Wonil Chang, Siddharth Sigtia, Saurabh Adya, Pramod, Simha, Pranay Dighe, Chandra Dhir

PDF

TL;DR

This paper introduces a streaming transformer architecture that efficiently performs voice trigger detection and false trigger mitigation on-device, reducing false alarms and resource usage.

Contribution

It proposes a unified streaming transformer model that processes audio in real-time for both trigger detection and false trigger mitigation, improving efficiency and accuracy.

Findings

01

18% reduction in false reject rate (FRR)

02

95% false trigger suppression with post-trigger audio

03

32% reduction in runtime memory and 56% faster inference

Abstract

We present a unified and hardware efficient architecture for two stage voice trigger detection (VTD) and false trigger mitigation (FTM) tasks. Two stage VTD systems of voice assistants can get falsely activated to audio segments acoustically similar to the trigger phrase of interest. FTM systems cancel such activations by using post trigger audio context. Traditional FTM systems rely on automatic speech recognition lattices which are computationally expensive to obtain on device. We propose a streaming transformer (TF) encoder architecture, which progressively processes incoming audio chunks and maintains audio context to perform both VTD and FTM tasks using only acoustic features. The proposed joint model yields an average 18% relative reduction in false reject rate (FRR) for the VTD task at a given false alarm rate. Moreover, our model suppresses 95% of the false triggers with an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.