GANtron: Emotional Speech Synthesis with Generative Adversarial Networks
Enrique Hortal, Rodrigo Brechard Alarcia

TL;DR
GANtron introduces an emotion-tunable speech synthesis model using GANs and sequence-to-sequence architecture, enabling more natural and expressive speech generation with improved training convergence.
Contribution
This work presents a novel GAN-based text-to-speech model that allows for easy emotion control and introduces a guided attention loss to enhance training convergence.
Findings
The best model generates speech similar to the training data distribution.
Four configurations were evaluated to optimize emotion tuning.
The guided attention loss improves training efficiency.
Abstract
Speech synthesis is used in a wide variety of industries. Nonetheless, it always sounds flat or robotic. The state of the art methods that allow for prosody control are very cumbersome to use and do not allow easy tuning. To tackle some of these drawbacks, in this work we target the implementation of a text-to-speech model where the inferred speech can be tuned with the desired emotions. To do so, we use Generative Adversarial Networks (GANs) together with a sequence-to-sequence model using an attention mechanism. We evaluate four different configurations considering different inputs and training strategies, study them and prove how our best model can generate speech files that lie in the same distribution as the initial training dataset. Additionally, a new strategy to boost the training convergence by applying a guided attention loss is proposed.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
