Masked Autoencoders that Listen

Po-Yao Huang; Hu Xu; Juncheng Li; Alexei Baevski; Michael Auli,; Wojciech Galuba; Florian Metze; Christoph Feichtenhofer

arXiv:2207.06405·cs.SD·January 13, 2023·109 cites

Masked Autoencoders that Listen

Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli,, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer

PDF

Open Access 4 Repos 8 Models 1 Video

TL;DR

This paper introduces Audio-MAE, a self-supervised learning method for audio spectrograms based on masked autoencoders, achieving state-of-the-art results on multiple audio classification tasks.

Contribution

It extends image-based Masked Autoencoders to audio spectrograms, incorporating local window attention and demonstrating superior performance without external supervision.

Findings

01

Sets new state-of-the-art on six audio and speech tasks

02

Outperforms models with external supervised pre-training

03

Effective use of local window attention in decoder

Abstract

This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. The code…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Masked Autoencoders that Listen· slideslive

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Masked autoencoder · Linear Layer · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer