Rethinking Masking Strategies for Masked Prediction-based Audio Self-supervised Learning

Daisuke Niizumi; Daiki Takeuchi; Masahiro Yasuda; Binh Thien Nguyen; Noboru Harada; and Nobutaka Ono

arXiv:2603.23810·eess.AS·March 26, 2026

Rethinking Masking Strategies for Masked Prediction-based Audio Self-supervised Learning

Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda, Binh Thien Nguyen, Noboru Harada, and Nobutaka Ono

PDF

Open Access

TL;DR

This paper introduces dispersion-weighted masking (DWM), a computationally efficient strategy for audio self-supervised learning that leverages spectral sparsity, improving performance over existing masking methods.

Contribution

We propose DWM, a lightweight masking strategy that reduces computational overhead and enhances audio representation learning compared to existing informed masking techniques.

Findings

01

DWM outperforms inverse block masking in audio event understanding.

02

DWM reduces computational complexity of masking strategies.

03

DWM provides consistent performance improvements across experiments.

Abstract

Since the introduction of Masked Autoencoders, various improvements to masking techniques have been explored. In this paper, we rethink masking strategies for audio representation learning using masked prediction-based self-supervised learning (SSL) on general audio spectrograms. While recent informed masking techniques have attracted attention, we observe that they incur substantial computational overhead. Motivated by this observation, we propose dispersion-weighted masking (DWM), a lightweight masking strategy that leverages the spectral sparsity inherent in the frequency structure of audio content. Our experiments show that inverse block masking, commonly used in recent SSL frameworks, improves audio event understanding performance while introducing a trade-off in generalization. The proposed DWM alleviates these limitations and computational complexity, leading to consistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing