Masked Autoencoders that Listen
Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli,, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer

TL;DR
This paper introduces Audio-MAE, a self-supervised learning method for audio spectrograms based on masked autoencoders, achieving state-of-the-art results on multiple audio classification tasks.
Contribution
It extends image-based Masked Autoencoders to audio spectrograms, incorporating local window attention and demonstrating superior performance without external supervision.
Findings
Sets new state-of-the-art on six audio and speech tasks
Outperforms models with external supervised pre-training
Effective use of local window attention in decoder
Abstract
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. The code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20kmodel· 255 dl· ♡ 2255 dl♡ 2
- 🤗gaunernst/vit_base_patch16_1024_128.audiomae_as2mmodel· 364 dl· ♡ 1364 dl♡ 1
- 🤗saurabhati/DASS_small_AudioSet_47.2model· 2 dl· ♡ 12 dl♡ 1
- 🤗saurabhati/DASS_medium_AudioSet_47.6model· 2 dl2 dl
- 🤗saurabhati/DASS_small_AudioSet_48.6model· 10 dl10 dl
- 🤗saurabhati/DASS_medium_AudioSet_48.9model
- 🤗saurabhati/DASS_small_AudioSet_50.1model· 45 dl45 dl
- 🤗saurabhati/DASS_medium_AudioSet_50.2model· 53 dl· ♡ 253 dl♡ 2
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Masked autoencoder · Linear Layer · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer
