Audio Transformers
Prateek Verma, Jonathan Berger

TL;DR
This paper introduces Transformer-based architectures for raw audio signal processing, achieving state-of-the-art results on the Free Sound 50K dataset without pre-training, and demonstrates their adaptability and improved performance over convolutional models.
Contribution
Proposes applying Transformer architectures directly to raw audio signals, incorporating pooling and wavelet-inspired multi-rate processing to enhance audio classification performance.
Findings
Transformer models outperform convolutional models on Free Sound 50K dataset.
Techniques like pooling and wavelet-inspired processing improve Transformer performance.
Models learn adaptive non-linear band-width filter-banks for audio understanding.
Abstract
Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be optimized for a particular task of interest, over a wide variety of datasets and labels. In fact similar architectures designed for image understanding have proven effective for acoustic scene analysis. Here we propose applying Transformer based architectures without convolutional layers to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200 categories, our model outperforms convolutional models to produce state of the art results. This is significant as unlike in natural language processing and computer vision, we do not perform unsupervised pre-training for outperforming convolutional architectures. On the same training set,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Residual Connection · Label Smoothing · Byte Pair Encoding · Dropout
