Self-supervised Learning for Speech Enhancement
Yu-Che Wang, Shrikant Venkataramani, Paris Smaragdis

TL;DR
This paper introduces a self-supervised learning approach for speech enhancement that leverages autoencoding and shared latent representations, eliminating the need for labeled noisy-clean speech pairs.
Contribution
It presents a novel self-supervised training schema that enables speech enhancement without requiring labeled training data or human intervention.
Findings
Effective mapping of noisy to clean speech using self-supervised autoencoding.
Reduces dependency on labeled datasets for speech enhancement.
Demonstrates autonomous training process for speech enhancement networks.
Abstract
Supervised learning for single-channel speech enhancement requires carefully labeled training examples where the noisy mixture is input into the network and the network is trained to produce an output close to the ideal target. To relax the conditions on the training data, we consider the task of training speech enhancement networks in a self-supervised manner. We first use a limited training set of clean speech sounds and learn a latent representation by autoencoding on their magnitude spectrograms. We then autoencode on speech mixtures recorded in noisy environments and train the resulting autoencoder to share a latent representation with the clean examples. We show that using this training schema, we can now map noisy speech to its clean version using a network that is autonomously trainable without requiring labeled training examples or human intervention.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
