Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training

Dading Chong; Helin Wang; Peilin Zhou; Qingcheng Zeng

arXiv:2204.12768·cs.SD·April 28, 2022·1 cites

Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training

Dading Chong, Helin Wang, Peilin Zhou, Qingcheng Zeng

PDF

Open Access 1 Repo

TL;DR

This paper introduces MaskSpec, a self-supervised learning method that masks and reconstructs spectrogram patches to learn effective audio representations, outperforming previous models on multiple audio classification benchmarks.

Contribution

The paper proposes a novel masked spectrogram prediction approach for self-supervised audio pre-training, improving performance without extra supervision or model weights.

Findings

01

Achieves state-of-the-art results on AudioSet with 0.471 mAP

02

Outperforms previous pre-trained models on multiple datasets

03

Demonstrates effectiveness of masked spectrogram reconstruction for audio tasks

Abstract

Transformer-based models attain excellent results and generalize well when trained on sufficient amounts of data. However, constrained by the limited data available in the audio domain, most transformer-based models for audio tasks are finetuned from pre-trained models in other domains (e.g. image), which has a notable gap with the audio domain. Other methods explore the self-supervised learning approaches directly in the audio domain but currently do not perform well in the downstream tasks. In this paper, we present a novel self-supervised learning method for transformer-based audio models, called masked spectrogram prediction (MaskSpec), to learn powerful audio representations from unlabeled audio data (AudioSet used in this paper). Our method masks random patches of the input spectrogram and reconstructs the masked regions with an encoder-decoder architecture. Without using extra…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wanghelin1997/maskspec
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Structural Health Monitoring Techniques