Speech Synthesis By Unrolling Diffusion Process using Neural Network Layers

Peter Ochieng

arXiv:2309.09652·cs.SD·June 12, 2025

Speech Synthesis By Unrolling Diffusion Process using Neural Network Layers

Peter Ochieng

PDF

Open Access

TL;DR

UDPNet introduces a novel neural architecture that unrolls the diffusion process into network layers for faster, high-quality speech synthesis, addressing early-stage prediction errors and outperforming existing methods.

Contribution

The paper presents UDPNet, a new approach that unrolls diffusion steps into network layers and predicts latent variables, improving speed and quality in speech synthesis.

Findings

01

Outperforms state-of-the-art in speech quality and efficiency

02

Generalizes well to unseen speakers

03

Enables real-time speech synthesis

Abstract

This work introduces UDPNet, a novel architecture designed to accelerate the reverse diffusion process in speech synthesis. Unlike traditional diffusion models that rely on timestep embeddings and shared network parameters, UDPNet unrolls the reverse diffusion process directly into the network architecture, with successive layers corresponding to equally spaced steps in the diffusion schedule. Each layer progressively refines the noisy input, culminating in a high-fidelity estimation of the original data, \(x_0\). Additionally, we redefine the learning target by predicting latent variables instead of the conventional \(x_0\) or noise \(\epsilon_0\). This shift addresses the common issue of large prediction errors in early denoising stages, effectively reducing speech distortion. Extensive evaluations on single- and multi-speaker datasets demonstrate that UDPNet consistently outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing