Masked Autoencoders as Universal Speech Enhancer

Rajalaxmi Rajagopalan; Ritwik Giri; Zhiqiang Tang; Kyu Han

arXiv:2602.02413·cs.SD·February 3, 2026

Masked Autoencoders as Universal Speech Enhancer

Rajalaxmi Rajagopalan, Ritwik Giri, Zhiqiang Tang, Kyu Han

PDF

Open Access

TL;DR

This paper introduces a self-supervised masked autoencoder approach for universal speech enhancement that effectively handles multiple distortions and improves downstream speech tasks, outperforming existing methods.

Contribution

The work presents a novel self-supervised masked autoencoder model that is distortion-agnostic and capable of enhancing speech across various types of noise and reverberation.

Findings

01

Outperforms baseline methods in speech enhancement tasks.

02

Achieves state-of-the-art results on in-domain and out-of-domain datasets.

03

Effective for both denoising and dereverberation applications.

Abstract

Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-related downstream applications are desired. In this work, we develop a masked autoencoder based universal speech enhancer that is agnostic to the type of distortion affecting speech, can handle multiple distortions simultaneously, and is trained in a self-supervised manner. An augmentation stack adds further distortions to the noisy input data. The masked autoencoder model learns to remove the added distortions along with reconstructing the masked regions of the spectrogram during pre-training. The pre-trained embeddings are then used by fine-tuning models trained on a small amount of paired data for specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis