Generative Adversarial Training for Text-to-Speech Synthesis Based on   Raw Phonetic Input and Explicit Prosody Modelling

Tiberiu Boros; Stefan Daniel Dumitrescu; Ionut Mironica; Radu; Chivereanu

arXiv:2310.09636·cs.LG·October 17, 2023·1 cites

Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling

Tiberiu Boros, Stefan Daniel Dumitrescu, Ionut Mironica, Radu, Chivereanu

PDF

Open Access 1 Repo

TL;DR

This paper presents an end-to-end speech synthesis system utilizing generative adversarial training, explicit phonetic and prosody modeling, and innovative style token methods to produce expressive, high-quality speech from raw phonetic input.

Contribution

It introduces a novel generative adversarial training approach for text-to-speech synthesis that incorporates explicit prosody modeling and style tokens for expressive voice matching.

Findings

01

Effective raw phoneme-to-audio conversion demonstrated

02

Enhanced expressive speech synthesis with style tokens

03

Improved naturalness through explicit prosody modeling

Abstract

We describe an end-to-end speech synthesis system that uses generative adversarial training. We train our Vocoder for raw phoneme-to-audio conversion, using explicit phonetic, pitch and duration modeling. We experiment with several pre-trained models for contextualized and decontextualized word embeddings and we introduce a new method for highly expressive character voice matching, based on discreet style tokens.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tiberiu44/TTS-Cube
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling