Study of positional encoding approaches for Audio Spectrogram   Transformers

Leonardo Pepino; Pablo Riera; Luciana Ferrer

arXiv:2110.06999·cs.SD·October 9, 2023

Study of positional encoding approaches for Audio Spectrogram Transformers

Leonardo Pepino, Pablo Riera, Luciana Ferrer

PDF

Open Access 1 Repo

TL;DR

This paper investigates various positional encoding methods for Audio Spectrogram Transformers, aiming to enhance their performance without relying on ImageNet pretraining, and demonstrates significant improvements on audio classification benchmarks.

Contribution

The paper introduces new positional encoding variants for ASTs that enable training from scratch and outperform the original model without ImageNet pretraining.

Findings

01

Conditional positional encodings improve AST performance

02

AST variants outperform original AST on Audioset and ESC-50

03

Training from scratch is feasible with proposed encodings

Abstract

Transformers have revolutionized the world of deep learning, specially in the field of natural language processing. Recently, the Audio Spectrogram Transformer (AST) was proposed for audio classification, leading to state of the art results in several datasets. However, in order for ASTs to outperform CNNs, pretraining with ImageNet is needed. In this paper, we study one component of the AST, the positional encoding, and propose several variants to improve the performance of ASTs trained from scratch, without ImageNet pretraining. Our best model, which incorporates conditional positional encodings, significantly improves performance on Audioset and ESC-50 compared to the original AST.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

habla-liaa/ast-pe
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Softmax · Residual Connection · Adam · Label Smoothing · Byte Pair Encoding · Dropout