TL;DR
The paper introduces Keyword Transformer (KWT), a fully self-attentional model for keyword spotting that surpasses existing methods without pre-training, achieving new benchmarks on Google Speech Commands.
Contribution
It presents a novel, simple Transformer-based architecture for keyword spotting that outperforms complex models and sets new state-of-the-art results without additional data.
Findings
KWT achieves 98.6% accuracy on 12-command task.
KWT achieves 97.7% accuracy on 35-command task.
KWT outperforms models with convolutional, recurrent, and attentive layers.
Abstract
The Transformer architecture has been successful across many domains, including natural language processing, computer vision and speech recognition. In keyword spotting, self-attention has primarily been used on top of convolutional or recurrent encoders. We investigate a range of ways to adapt the Transformer architecture to keyword spotting and introduce the Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data. Surprisingly, this simple architecture outperforms more complex models that mix convolutional, recurrent and attentive layers. KWT can be used as a drop-in replacement for these models, setting two new benchmark records on the Google Speech Commands dataset with 98.6% and 97.7% accuracy on the 12 and 35-command tasks respectively.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Attention Is All You Need · Dropout · Residual Connection · Layer Normalization · Label Smoothing · Adam · Multi-Head Attention
