Study of positional encoding approaches for Audio Spectrogram Transformers
Leonardo Pepino, Pablo Riera, Luciana Ferrer

TL;DR
This paper investigates various positional encoding methods for Audio Spectrogram Transformers, aiming to enhance their performance without relying on ImageNet pretraining, and demonstrates significant improvements on audio classification benchmarks.
Contribution
The paper introduces new positional encoding variants for ASTs that enable training from scratch and outperform the original model without ImageNet pretraining.
Findings
Conditional positional encodings improve AST performance
AST variants outperform original AST on Audioset and ESC-50
Training from scratch is feasible with proposed encodings
Abstract
Transformers have revolutionized the world of deep learning, specially in the field of natural language processing. Recently, the Audio Spectrogram Transformer (AST) was proposed for audio classification, leading to state of the art results in several datasets. However, in order for ASTs to outperform CNNs, pretraining with ImageNet is needed. In this paper, we study one component of the AST, the positional encoding, and propose several variants to improve the performance of ASTs trained from scratch, without ImageNet pretraining. Our best model, which incorporates conditional positional encodings, significantly improves performance on Audioset and ESC-50 compared to the original AST.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Softmax · Residual Connection · Adam · Label Smoothing · Byte Pair Encoding · Dropout
