High Fidelity Speech Regeneration with Application to Speech Enhancement
Adam Polyak, Lior Wolf, Yossi Adi, Ori Kabeli, Yaniv Taigman

TL;DR
This paper introduces a real-time wav-to-wav generative model for speech regeneration that enhances intelligibility by leveraging semi-recognized speech, prosody, and identity features, surpassing recent baselines.
Contribution
It presents a novel speech regeneration approach using a compact representation and auxiliary identity network, improving speech quality beyond traditional enhancement methods.
Findings
Achieves high-fidelity 24kHz speech regeneration in real-time.
Improves intelligibility and quality over recent baselines.
Utilizes a compact speech representation with ASR and identity features.
Abstract
Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration approach, in which we recreate the speech from its essence, including the semi-recognized speech, prosody features, and identity. We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner and which utilizes a compact speech representation, composed of ASR and identity features, to achieve a higher level of intelligibility. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source using an auxiliary identity network. Perceptual acoustic metrics and subjective tests show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
