Speech Synthesis By Unrolling Diffusion Process using Neural Network Layers
Peter Ochieng

TL;DR
UDPNet introduces a novel neural architecture that unrolls the diffusion process into network layers for faster, high-quality speech synthesis, addressing early-stage prediction errors and outperforming existing methods.
Contribution
The paper presents UDPNet, a new approach that unrolls diffusion steps into network layers and predicts latent variables, improving speed and quality in speech synthesis.
Findings
Outperforms state-of-the-art in speech quality and efficiency
Generalizes well to unseen speakers
Enables real-time speech synthesis
Abstract
This work introduces UDPNet, a novel architecture designed to accelerate the reverse diffusion process in speech synthesis. Unlike traditional diffusion models that rely on timestep embeddings and shared network parameters, UDPNet unrolls the reverse diffusion process directly into the network architecture, with successive layers corresponding to equally spaced steps in the diffusion schedule. Each layer progressively refines the noisy input, culminating in a high-fidelity estimation of the original data, \(x_0\). Additionally, we redefine the learning target by predicting latent variables instead of the conventional \(x_0\) or noise \(\epsilon_0\). This shift addresses the common issue of large prediction errors in early denoising stages, effectively reducing speech distortion. Extensive evaluations on single- and multi-speaker datasets demonstrate that UDPNet consistently outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
