Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

Vadim Popov; Ivan Vovk; Vladimir Gogoryan; Tasnima Sadekova; Mikhail; Kudinov

arXiv:2105.06337·cs.LG·August 6, 2021·43 cites

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail, Kudinov

PDF

Open Access 5 Repos 3 Models 1 Datasets 1 Video

TL;DR

Grad-TTS introduces a diffusion probabilistic model for text-to-speech synthesis that generates mel-spectrograms through a score-based decoder, offering flexible inference and competitive sound quality.

Contribution

It presents a novel diffusion-based TTS model utilizing stochastic differential equations and monotonic alignment, enhancing flexibility and performance over existing methods.

Findings

01

Competitive Mean Opinion Scores with state-of-the-art TTS models

02

Flexible control over sound quality and inference speed

03

Effective reconstruction of mel-spectrograms from noise

Abstract

Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by gradually transforming noise predicted by encoder and aligned with text input by means of Monotonic Alignment Search. The framework of stochastic differential equations helps us to generalize conventional diffusion probabilistic models to the case of reconstructing data from noise with different parameters and allows to make this reconstruction flexible by explicitly controlling trade-off between sound quality and inference speed. Subjective human evaluation shows that Grad-TTS is competitive with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

purdueviperlab/diffssd
dataset· 37 dl
37 dl

Videos

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech· slideslive

Taxonomy

TopicsMusic and Audio Processing · Opinion Dynamics and Social Influence · Speech and Audio Processing

MethodsDiffusion