Keyword Transformer: A Self-Attention Model for Keyword Spotting

Axel Berg; Mark O'Connor; Miguel Tairum Cruz

arXiv:2104.00769·eess.AS·April 11, 2022

Keyword Transformer: A Self-Attention Model for Keyword Spotting

Axel Berg, Mark O'Connor, Miguel Tairum Cruz

PDF

5 Repos

TL;DR

The paper introduces Keyword Transformer (KWT), a fully self-attentional model for keyword spotting that surpasses existing methods without pre-training, achieving new benchmarks on Google Speech Commands.

Contribution

It presents a novel, simple Transformer-based architecture for keyword spotting that outperforms complex models and sets new state-of-the-art results without additional data.

Findings

01

KWT achieves 98.6% accuracy on 12-command task.

02

KWT achieves 97.7% accuracy on 35-command task.

03

KWT outperforms models with convolutional, recurrent, and attentive layers.

Abstract

The Transformer architecture has been successful across many domains, including natural language processing, computer vision and speech recognition. In keyword spotting, self-attention has primarily been used on top of convolutional or recurrent encoders. We investigate a range of ways to adapt the Transformer architecture to keyword spotting and introduce the Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data. Surprisingly, this simple architecture outperforms more complex models that mix convolutional, recurrent and attentive layers. KWT can be used as a drop-in replacement for these models, setting two new benchmark records on the Google Speech Commands dataset with 98.6% and 97.7% accuracy on the 12 and 35-command tasks respectively.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Attention Is All You Need · Dropout · Residual Connection · Layer Normalization · Label Smoothing · Adam · Multi-Head Attention