Multiscale Audio Spectrogram Transformer for Efficient Audio   Classification

Wentao Zhu; Mohamed Omar

arXiv:2303.10757·cs.SD·March 21, 2023·1 cites

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Wentao Zhu, Mohamed Omar

PDF

Open Access

TL;DR

This paper introduces MAST, a hierarchical multiscale Transformer for audio classification that improves accuracy and efficiency by leveraging multiscale pooling and hierarchical representations, outperforming previous models on multiple datasets.

Contribution

Develops MAST, a multiscale hierarchical Transformer architecture that enhances audio classification accuracy and efficiency without external data, outperforming AST on several benchmarks.

Findings

01

MAST outperforms AST by 22.2%, 4.4%, and 4.7% on Kinetics-Sounds, Epic-Kitchens-100, and VGGSound.

02

MAST achieves slightly better accuracy than AST on AudioSet with missing data.

03

MAST is 5x more efficient in MACs and has 42% fewer parameters than AST.

Abstract

Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along the time (and frequency domains) in different stages, and progressively reduces the number of tokens and increases the feature dimensions. MAST significantly outperforms AST~\cite{gong2021ast} by 22.2\%, 4.4\% and 4.7\% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. On the downloaded AudioSet dataset, which has over 20\% missing audios, MAST also achieves slightly better accuracy than AST. In addition, MAST is 5x more efficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Adam · Softmax · Label Smoothing · Byte Pair Encoding · Residual Connection