Comparing normalizing flows and diffusion models for prosody and   acoustic modelling in text-to-speech

Guangyan Zhang; Thomas Merritt; Manuel Sam Ribeiro; Biel Tura-Vecino,; Kayoko Yanagisawa; Kamil Pokora; Abdelhamid Ezzerg; Sebastian Cygert; Ammar; Abbas; Piotr Bilinski; Roberto Barra-Chicote; Daniel Korzekwa; Jaime; Lorenzo-Trueba

arXiv:2307.16679·eess.AS·August 1, 2023

Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Guangyan Zhang, Thomas Merritt, Manuel Sam Ribeiro, Biel Tura-Vecino,, Kayoko Yanagisawa, Kamil Pokora, Abdelhamid Ezzerg, Sebastian Cygert, Ammar, Abbas, Piotr Bilinski, Roberto Barra-Chicote, Daniel Korzekwa, Jaime, Lorenzo-Trueba

PDF

Open Access

TL;DR

This paper compares normalizing flows and diffusion models to traditional L1/L2 approaches for prosody and acoustic modeling in text-to-speech, showing flow models excel in spectrogram prediction and both advanced models improve prosody prediction.

Contribution

It provides a comparative analysis of flow-based and diffusion models against L1/L2 methods for TTS prosody and spectrogram prediction, highlighting their advantages.

Findings

01

Flow-based models outperform diffusion and L1 models in spectrogram prediction.

02

Diffusion and flow-based prosody models significantly outperform L2-trained prosody models.

03

Both advanced models improve prosody prediction quality.

Abstract

Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosody and mel-spectrogram prediction for text-to-speech synthesis. We use a prosody model to generate log-f0 and duration features, which are used to condition an acoustic model that generates mel-spectrograms. Experimental results demonstrate that the flow-based model achieves the best performance for spectrogram prediction, improving over equivalent diffusion and L1 models. Meanwhile, both diffusion and flow-based prosody predictors result in significant improvements over a typical L2-trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Neural Networks and Applications

MethodsDiffusion · Normalizing Flows