Voice-ENHANCE: Speech Restoration using a Diffusion-based Voice Conversion Framework

Kyungguen Byun; Jason Filos; Erik Visser; Sunkuk Moon

arXiv:2505.15254·cs.SD·May 22, 2025

Voice-ENHANCE: Speech Restoration using a Diffusion-based Voice Conversion Framework

Kyungguen Byun, Jason Filos, Erik Visser, Sunkuk Moon

PDF

Open Access

TL;DR

This paper introduces Voice-ENHANCE, a diffusion-based speech restoration framework combining noise suppression and voice conversion to achieve high-quality, studio-level speech signals even in noisy conditions.

Contribution

It presents a novel two-stage system integrating generative speech restoration with voice conversion, improving speech quality in noisy environments.

Findings

01

Achieved SOTA objective metric scores across multiple datasets.

02

Effectively combines noise suppression with voice conversion for speech restoration.

03

Demonstrated robustness in noisy conditions with high-quality output.

Abstract

We propose a speech enhancement system that combines speaker-agnostic speech restoration with voice conversion (VC) to obtain a studio-level quality speech signal. While voice conversion models are typically used to change speaker characteristics, they can also serve as a means of speech restoration when the target speaker is the same as the source speaker. However, since VC models are vulnerable to noisy conditions, we have included a generative speech restoration (GSR) model at the front end of our proposed system. The GSR model performs noise suppression and restores speech damage incurred during that process without knowledge about the target speaker. The VC stage then uses guidance from clean speaker embeddings to further restore the output speech. By employing this two-stage approach, we have achieved speech quality objective metric scores comparable to state-of-the-art (SOTA)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis