Multiscale Audio Spectrogram Transformer for Efficient Audio Classification
Wentao Zhu, Mohamed Omar

TL;DR
This paper introduces MAST, a hierarchical multiscale Transformer for audio classification that improves accuracy and efficiency by leveraging multiscale pooling and hierarchical representations, outperforming previous models on multiple datasets.
Contribution
Develops MAST, a multiscale hierarchical Transformer architecture that enhances audio classification accuracy and efficiency without external data, outperforming AST on several benchmarks.
Findings
MAST outperforms AST by 22.2%, 4.4%, and 4.7% on Kinetics-Sounds, Epic-Kitchens-100, and VGGSound.
MAST achieves slightly better accuracy than AST on AudioSet with missing data.
MAST is 5x more efficient in MACs and has 42% fewer parameters than AST.
Abstract
Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along the time (and frequency domains) in different stages, and progressively reduces the number of tokens and increases the feature dimensions. MAST significantly outperforms AST~\cite{gong2021ast} by 22.2\%, 4.4\% and 4.7\% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. On the downloaded AudioSet dataset, which has over 20\% missing audios, MAST also achieves slightly better accuracy than AST. In addition, MAST is 5x more efficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Adam · Softmax · Label Smoothing · Byte Pair Encoding · Residual Connection
