A Composite Predictive-Generative Approach to Monaural Universal Speech Enhancement
Jie Zhang, Haoyin Yan, Xiaofei Li

TL;DR
This paper introduces PGUSE, a novel universal speech enhancement model that combines predictive and generative approaches to improve speech quality and robustness in various distortions, addressing limitations of existing methods.
Contribution
The paper proposes a joint predictive-generative model for speech enhancement, integrating diffusion models with direct prediction to outperform state-of-the-art baselines.
Findings
PGUSE outperforms existing methods on multiple datasets.
The fusion of predictive and generative models improves speech quality.
The approach effectively handles severely degraded speech signals.
Abstract
It is promising to design a single model that can suppress various distortions and improve speech quality, i.e., universal speech enhancement (USE). Compared to supervised learning-based predictive methods, diffusion-based generative models have shown greater potential due to the generative capacities from degraded speech with severely damaged information. However, artifacts may be introduced in highly adverse conditions, and diffusion models often suffer from a heavy computational burden due to many steps for inference. In order to jointly leverage the superiority of prediction and generation and overcome the respective defects, in this work we propose a universal speech enhancement model called PGUSE by combining predictive and generative modeling. Our model consists of two branches: the predictive branch directly predicts clean samples from degraded signals, while the generative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
MethodsDiffusion
