Audiomer: A Convolutional Transformer For Keyword Spotting

Surya Kant Sahu; Sai Mitheran; Juhi Kamdar; Meet Gandhi

arXiv:2109.10252·cs.LG·February 2, 2022

Audiomer: A Convolutional Transformer For Keyword Spotting

Surya Kant Sahu, Sai Mitheran, Juhi Kamdar, Meet Gandhi

PDF

Open Access

TL;DR

Audiomer is a novel convolutional transformer architecture that combines residual networks with Performer attention, achieving state-of-the-art keyword spotting performance on raw audio while being computationally efficient and capable of processing arbitrarily long clips.

Contribution

Introduces Audiomer, a new architecture combining 1D residual networks with Performer attention for efficient, high-performance keyword spotting on raw audio.

Findings

01

Outperforms previous methods in keyword spotting accuracy

02

More computationally efficient and parameter-friendly

03

Handles arbitrarily long audio clips without positional encoding

Abstract

Transformers have seen an unprecedented rise in Natural Language Processing and Computer Vision tasks. However, in audio tasks, they are either infeasible to train due to extremely large sequence length of audio waveforms or incur a performance penalty when trained on Fourier-based features. In this work, we introduce an architecture, Audiomer, where we combine 1D Residual Networks with Performer Attention to achieve state-of-the-art performance in keyword spotting with raw audio waveforms, outperforming all previous methods while being computationally cheaper and parameter-efficient. Additionally, our model has practical advantages for speech processing, such as inference on arbitrarily long audio clips owing to the absence of positional encoding. The code is available at https://github.com/The-Learning-Machines/Audiomer-PyTorch.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Topic Modeling

MethodsFast Attention Via Positive Orthogonal Random Features · Performer