Enhancing audio quality for expressive Neural Text-to-Speech

Abdelhamid Ezzerg; Adam Gabrys; Bartosz Putrycz; Daniel Korzekwa,; Daniel Saez-Trigueros; David McHardy; Kamil Pokora; Jakub Lachowicz; Jaime; Lorenzo-Trueba; Viacheslav Klimkov

arXiv:2108.06270·eess.AS·August 16, 2021

Enhancing audio quality for expressive Neural Text-to-Speech

Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa,, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime, Lorenzo-Trueba, Viacheslav Klimkov

PDF

Open Access

TL;DR

This paper introduces techniques to improve the audio quality of expressive neural TTS systems, achieving a 39% increase in perceived naturalness without extra data.

Contribution

It proposes novel methods combining autoregressive tuning, GANs, and VAEs to enhance expressiveness and signal quality in neural TTS.

Findings

01

39% improvement in MUSHRA scores for expressive voice

02

Techniques do not require additional data

03

Significant closing of naturalness gap

Abstract

Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio and its signal quality. In this paper, we present a set of techniques that can be leveraged to enhance the signal quality of a highly-expressive voice without the use of additional data. The proposed techniques include: tuning the autoregressive loop's granularity during training; using Generative Adversarial Networks in acoustic modelling; and the use of Variational Auto-Encoders in both the acoustic model and the neural vocoder. We show that, when combined, these techniques greatly closed the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling