AST: Audio Spectrogram Transformer
Yuan Gong, Yu-An Chung, James Glass

TL;DR
This paper introduces the Audio Spectrogram Transformer (AST), a novel convolution-free, attention-only model that achieves state-of-the-art results in audio classification benchmarks.
Contribution
The paper presents the first purely attention-based model for audio classification, eliminating the need for CNNs and demonstrating superior performance.
Findings
Achieves 0.485 mAP on AudioSet
Attains 95.6% accuracy on ESC-50
Reaches 98.1% accuracy on Speech Commands V2
Abstract
In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗MIT/ast-finetuned-audioset-10-10-0.4593model· 796k dl· ♡ 351796k dl♡ 351
- 🤗MIT/ast-finetuned-speech-commands-v2model· 1.7k dl· ♡ 181.7k dl♡ 18
- 🤗MIT/ast-finetuned-audioset-10-10-0.450model· 337 dl· ♡ 4337 dl♡ 4
- 🤗MIT/ast-finetuned-audioset-10-10-0.448model· 369 dl· ♡ 1369 dl♡ 1
- 🤗MIT/ast-finetuned-audioset-10-10-0.448-v2model· 51 dl51 dl
- 🤗MIT/ast-finetuned-audioset-12-12-0.447model· 734 dl734 dl
- 🤗MIT/ast-finetuned-audioset-14-14-0.443model· 15k dl· ♡ 615k dl♡ 6
- 🤗MIT/ast-finetuned-audioset-16-16-0.442model· 1.9k dl· ♡ 11.9k dl♡ 1
- 🤗bookbot/distil-ast-audiosetmodel· 2.1k dl· ♡ 242.1k dl♡ 24
- 🤗saurabhati/DASS_small_AudioSet_47.2model· 2 dl· ♡ 12 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Dropout · Attention Is All You Need · Byte Pair Encoding · Residual Connection · Layer Normalization · Label Smoothing
