Voice-ENHANCE: Speech Restoration using a Diffusion-based Voice Conversion Framework
Kyungguen Byun, Jason Filos, Erik Visser, Sunkuk Moon

TL;DR
This paper introduces Voice-ENHANCE, a diffusion-based speech restoration framework combining noise suppression and voice conversion to achieve high-quality, studio-level speech signals even in noisy conditions.
Contribution
It presents a novel two-stage system integrating generative speech restoration with voice conversion, improving speech quality in noisy environments.
Findings
Achieved SOTA objective metric scores across multiple datasets.
Effectively combines noise suppression with voice conversion for speech restoration.
Demonstrated robustness in noisy conditions with high-quality output.
Abstract
We propose a speech enhancement system that combines speaker-agnostic speech restoration with voice conversion (VC) to obtain a studio-level quality speech signal. While voice conversion models are typically used to change speaker characteristics, they can also serve as a means of speech restoration when the target speaker is the same as the source speaker. However, since VC models are vulnerable to noisy conditions, we have included a generative speech restoration (GSR) model at the front end of our proposed system. The GSR model performs noise suppression and restores speech damage incurred during that process without knowledge about the target speaker. The VC stage then uses guidance from clean speaker embeddings to further restore the output speech. By employing this two-stage approach, we have achieved speech quality objective metric scores comparable to state-of-the-art (SOTA)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
