Can We Trust Deep Speech Prior?
Ying Shi, Haolin Chen, Zhiyuan Tang, Lantian Li, Dong Wang, Jiqing, Han

TL;DR
This paper critically examines the use of deep speech priors in speech enhancement, revealing that while they can improve performance, their effectiveness is limited by training and model flexibility issues.
Contribution
It provides a comprehensive analysis of deep speech priors, highlighting potential pitfalls and the mismatch between model flexibility and maximum-likelihood training.
Findings
Deep speech priors can achieve reasonable SE performance.
Results may be suboptimal due to model and training disharmony.
Analysis reveals limitations of deep generative models in speech enhancement.
Abstract
Recently, speech enhancement (SE) based on deep speech prior has attracted much attention, such as the variational auto-encoder with non-negative matrix factorization (VAE-NMF) architecture. Compared to conventional approaches that represent clean speech by shallow models such as Gaussians with a low-rank covariance, the new approach employs deep generative models to represent the clean speech, which often provides a better prior. Despite the clear advantage in theory, we argue that deep priors must be used with much caution, since the likelihood produced by a deep generative model does not always coincide with the speech quality. We designed a comprehensive study on this issue and demonstrated that based on deep speech priors, a reasonable SE performance can be achieved, but the results might be suboptimal. A careful analysis showed that this problem is deeply rooted in the disharmony…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
