AST: Audio Spectrogram Transformer

Yuan Gong; Yu-An Chung; James Glass

arXiv:2104.01778·cs.SD·July 12, 2021·31 cites

AST: Audio Spectrogram Transformer

Yuan Gong, Yu-An Chung, James Glass

PDF

Open Access 5 Repos 10 Models 2 Datasets

TL;DR

This paper introduces the Audio Spectrogram Transformer (AST), a novel convolution-free, attention-only model that achieves state-of-the-art results in audio classification benchmarks.

Contribution

The paper presents the first purely attention-based model for audio classification, eliminating the need for CNNs and demonstrating superior performance.

Findings

01

Achieves 0.485 mAP on AudioSet

02

Attains 95.6% accuracy on ESC-50

03

Reaches 98.1% accuracy on Speech Commands V2

Abstract

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Dropout · Attention Is All You Need · Byte Pair Encoding · Residual Connection · Layer Normalization · Label Smoothing