DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform   Generation

Roi Benita; Michael Elad; Joseph Keshet

arXiv:2310.01381·cs.SD·March 12, 2024·1 cites

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation

Roi Benita, Michael Elad, Joseph Keshet

PDF

Open Access 1 Repo

TL;DR

DiffAR introduces an autoregressive diffusion model that directly generates raw speech waveforms, enabling high-quality, natural, and coherent speech synthesis with diverse outputs, surpassing existing neural speech generation methods.

Contribution

This work presents the first end-to-end diffusion probabilistic model for raw speech waveform generation, combining autoregressive conditioning for improved naturalness and diversity.

Findings

01

Outperforms state-of-the-art neural speech generators in quality

02

Enables natural local acoustic behaviors like vocal fry

03

Produces diverse speech outputs due to stochastic diffusion process

Abstract

Diffusion models have recently been shown to be relevant for high-quality speech generation. Most work has been focused on generating spectrograms, and as such, they further require a subsequent model to convert the spectrogram to a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. The proposed model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one. Hence, our model can effectively synthesize an unlimited speech duration while preserving high-fidelity synthesis and temporal coherence. We implemented the proposed model for unconditional and conditional speech generation, where the latter can be driven by an input sequence of phonemes, amplitudes, and pitch values. Working on the waveform directly has some empirical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rbenita/diffar
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsDiffusion