Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

Saurabh Adya; Vineet Garg; Siddharth Sigtia; Pramod Simha; Chandra; Dhir

arXiv:2008.02323·eess.AS·August 7, 2020

Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

Saurabh Adya, Vineet Garg, Siddharth Sigtia, Pramod Simha, Chandra, Dhir

PDF

TL;DR

This paper introduces a hybrid self-attention and CTC-based network architecture for voice trigger detection, achieving higher accuracy, fewer parameters, and faster inference and training times compared to traditional BiLSTM models.

Contribution

It proposes a novel hybrid transformer/CTC network with multi-task learning for efficient and accurate voice trigger detection, outperforming baseline models in accuracy and speed.

Findings

01

60% reduction in false reject rates at the same false alarm rate

02

10% fewer parameters required by the new models

03

70% reduction in inference time on-device

Abstract

We consider the design of two-pass voice trigger detection systems. We focus on the networks in the second pass that are used to re-score candidate segments obtained from the first-pass. Our baseline is an acoustic model(AM), with BiLSTM layers, trained by minimizing the CTC loss. We replace the BiLSTM layers with self-attention layers. Results on internal evaluation sets show that self-attention networks yield better accuracy while requiring fewer parameters. We add an auto-regressive decoder network on top of the self-attention layers and jointly minimize the CTC loss on the encoder and the cross-entropy loss on the decoder. This design yields further improvements over the baseline. We retrain all the models above in a multi-task learning(MTL) setting, where one branch of a shared network is trained as an AM, while the second branch classifies the whole sequence to be true-trigger or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Model · Tanh Activation · Sigmoid Activation · Long Short-Term Memory · Bidirectional LSTM · Connectionist Temporal Classification Loss