Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling
Tiberiu Boros, Stefan Daniel Dumitrescu, Ionut Mironica, Radu, Chivereanu

TL;DR
This paper presents an end-to-end speech synthesis system utilizing generative adversarial training, explicit phonetic and prosody modeling, and innovative style token methods to produce expressive, high-quality speech from raw phonetic input.
Contribution
It introduces a novel generative adversarial training approach for text-to-speech synthesis that incorporates explicit prosody modeling and style tokens for expressive voice matching.
Findings
Effective raw phoneme-to-audio conversion demonstrated
Enhanced expressive speech synthesis with style tokens
Improved naturalness through explicit prosody modeling
Abstract
We describe an end-to-end speech synthesis system that uses generative adversarial training. We train our Vocoder for raw phoneme-to-audio conversion, using explicit phonetic, pitch and duration modeling. We experiment with several pre-trained models for contextualized and decontextualized word embeddings and we introduce a new method for highly expressive character voice matching, based on discreet style tokens.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
