Masked Autoencoders as Universal Speech Enhancer
Rajalaxmi Rajagopalan, Ritwik Giri, Zhiqiang Tang, Kyu Han

TL;DR
This paper introduces a self-supervised masked autoencoder approach for universal speech enhancement that effectively handles multiple distortions and improves downstream speech tasks, outperforming existing methods.
Contribution
The work presents a novel self-supervised masked autoencoder model that is distortion-agnostic and capable of enhancing speech across various types of noise and reverberation.
Findings
Outperforms baseline methods in speech enhancement tasks.
Achieves state-of-the-art results on in-domain and out-of-domain datasets.
Effective for both denoising and dereverberation applications.
Abstract
Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-related downstream applications are desired. In this work, we develop a masked autoencoder based universal speech enhancer that is agnostic to the type of distortion affecting speech, can handle multiple distortions simultaneously, and is trained in a self-supervised manner. An augmentation stack adds further distortions to the noisy input data. The masked autoencoder model learns to remove the added distortions along with reconstructing the masked regions of the spectrogram during pre-training. The pre-trained embeddings are then used by fine-tuning models trained on a small amount of paired data for specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis
