Audiomer: A Convolutional Transformer For Keyword Spotting
Surya Kant Sahu, Sai Mitheran, Juhi Kamdar, Meet Gandhi

TL;DR
Audiomer is a novel convolutional transformer architecture that combines residual networks with Performer attention, achieving state-of-the-art keyword spotting performance on raw audio while being computationally efficient and capable of processing arbitrarily long clips.
Contribution
Introduces Audiomer, a new architecture combining 1D residual networks with Performer attention for efficient, high-performance keyword spotting on raw audio.
Findings
Outperforms previous methods in keyword spotting accuracy
More computationally efficient and parameter-friendly
Handles arbitrarily long audio clips without positional encoding
Abstract
Transformers have seen an unprecedented rise in Natural Language Processing and Computer Vision tasks. However, in audio tasks, they are either infeasible to train due to extremely large sequence length of audio waveforms or incur a performance penalty when trained on Fourier-based features. In this work, we introduce an architecture, Audiomer, where we combine 1D Residual Networks with Performer Attention to achieve state-of-the-art performance in keyword spotting with raw audio waveforms, outperforming all previous methods while being computationally cheaper and parameter-efficient. Additionally, our model has practical advantages for speech processing, such as inference on arbitrarily long audio clips owing to the absence of positional encoding. The code is available at https://github.com/The-Learning-Machines/Audiomer-PyTorch.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Topic Modeling
MethodsFast Attention Via Positive Orthogonal Random Features · Performer
