Unsupervised Speech Enhancement with speech recognition embedding and disentanglement losses
Viet Anh Trinh (1), Sebastian Braun (2) ((1) CUNY Graduate Center, (2), Microsoft Research)

TL;DR
This paper introduces an unsupervised speech enhancement method that combines speech recognition embeddings and disentanglement losses, addressing domain mismatch and performance trade-offs in supervised systems.
Contribution
It proposes a novel unsupervised loss function extending MixIT with recognition embeddings and disentanglement, improving speech enhancement and ASR performance.
Findings
Improves speech enhancement over supervised baseline on VoxCeleb dataset.
Joint supervised and unsupervised training achieves comparable speech quality and better ASR.
Fully unsupervised training alone does not surpass supervised baseline.
Abstract
Speech enhancement has recently achieved great success with various deep learning methods. However, most conventional speech enhancement systems are trained with supervised methods that impose two significant challenges. First, a majority of training datasets for speech enhancement systems are synthetic. When mixing clean speech and noisy corpora to create the synthetic datasets, domain mismatches occur between synthetic and real-world recordings of noisy speech or audio. Second, there is a trade-off between increasing speech enhancement performance and degrading speech recognition (ASR) performance. Thus, we propose an unsupervised loss function to tackle those two problems. Our function is developed by extending the MixIT loss function with speech recognition embedding and disentanglement loss. Our results show that the proposed function effectively improves the speech enhancement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
