DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation
Roi Benita, Michael Elad, Joseph Keshet

TL;DR
DiffAR introduces an autoregressive diffusion model that directly generates raw speech waveforms, enabling high-quality, natural, and coherent speech synthesis with diverse outputs, surpassing existing neural speech generation methods.
Contribution
This work presents the first end-to-end diffusion probabilistic model for raw speech waveform generation, combining autoregressive conditioning for improved naturalness and diversity.
Findings
Outperforms state-of-the-art neural speech generators in quality
Enables natural local acoustic behaviors like vocal fry
Produces diverse speech outputs due to stochastic diffusion process
Abstract
Diffusion models have recently been shown to be relevant for high-quality speech generation. Most work has been focused on generating spectrograms, and as such, they further require a subsequent model to convert the spectrogram to a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. The proposed model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one. Hence, our model can effectively synthesize an unlimited speech duration while preserving high-fidelity synthesis and temporal coherence. We implemented the proposed model for unconditional and conditional speech generation, where the latter can be driven by an input sequence of phonemes, amplitudes, and pitch values. Working on the waveform directly has some empirical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsDiffusion
