Adversarial Learning of Intermediate Acoustic Feature for End-to-End   Lightweight Text-to-Speech

Hyungchan Yoon; Seyun Um; Changwhan Kim; Hong-Goo Kang

arXiv:2204.02172·cs.SD·August 29, 2023

Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

Hyungchan Yoon, Seyun Um, Changwhan Kim, Hong-Goo Kang

PDF

Open Access

TL;DR

This paper enhances end-to-end lightweight text-to-speech systems by incorporating prosody embeddings estimated via GANs, leading to improved speech quality and alignment with fewer parameters.

Contribution

It introduces a novel method of using GANs to estimate prosody embeddings from text, improving TTS performance without increasing model complexity.

Findings

01

Outperforms existing models in quality and efficiency

02

Uses GANs for fast, reliable prosody embedding estimation

03

Achieves better text-acoustic alignment

Abstract

To simplify the generation process, several text-to-speech (TTS) systems implicitly learn intermediate latent representations instead of relying on predefined features (e.g., mel-spectrogram). However, their generation quality is unsatisfactory as these representations lack speech variances. In this paper, we improve TTS performance by adding \emph{prosody embeddings} to the latent representations. During training, we extract reference prosody embeddings from mel-spectrograms, and during inference, we estimate these embeddings from text using generative adversarial networks (GANs). Using GANs, we reliably estimate the prosody embeddings in a fast way, which have complex distributions due to the dynamic nature of speech. We also show that the prosody embeddings work as efficient features for learning a robust alignment between text and acoustic features. Our proposed model surpasses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing