Restoring degraded speech via a modified diffusion model
Jianwei Zhang, Suren Jayasuriya, Visar Berisha

TL;DR
This paper introduces a modified diffusion model-based neural network to effectively restore degraded speech signals, improving quality across various types of degradation with better perceptual metrics and subjective assessments.
Contribution
The paper presents a novel modification to the DiffWave model, replacing its mel-spectrum upsampler with a deep CNN to enhance speech restoration from degraded inputs.
Findings
Improved speech quality on LPC-10 and AMR-NB compressed speech.
Enhanced perceptual metrics and subjective quality scores.
Better performance in out-of-corpus evaluations.
Abstract
There are many deterministic mathematical operations (e.g. compression, clipping, downsampling) that degrade speech quality considerably. In this paper we introduce a neural network architecture, based on a modification of the DiffWave model, that aims to restore the original speech signal. DiffWave, a recently published diffusion-based vocoder, has shown state-of-the-art synthesized speech quality and relatively shorter waveform generation times, with only a small set of parameters. We replace the mel-spectrum upsampler in DiffWave with a deep CNN upsampler, which is trained to alter the degraded speech mel-spectrum to match that of the original speech. The model is trained using the original speech waveform, but conditioned on the degraded speech mel-spectrum. Post-training, only the degraded mel-spectrum is used as input and the model generates an estimate of the original speech. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
