Adjusting Pleasure-Arousal-Dominance for Continuous Emotional Text-to-speech Synthesizer
Azam Rabiee, Tae-Ho Kim, Soo-Young Lee

TL;DR
This paper proposes a method to incorporate and adjust the Pleasure-Arousal-Dominance emotional dimensions into an end-to-end neural TTS system, enabling more nuanced and continuous emotional speech synthesis.
Contribution
It introduces an optimized neural architecture for integrating PAD emotional dimensions into Tacotron-based TTS and presents a method for adjusting these dimensions for synthesis.
Findings
Optimal network architecture for PAD integration identified
PAD values can be effectively adjusted for speech synthesis
Enables continuous and unlimited emotional expression in TTS
Abstract
Emotion is not limited to discrete categories of happy, sad, angry, fear, disgust, surprise, and so on. Instead, each emotion category is projected into a set of nearly independent dimensions, named pleasure (or valence), arousal, and dominance, known as PAD. The value of each dimension varies from -1 to 1, such that the neutral emotion is in the center with all-zero values. Training an emotional continuous text-to-speech (TTS) synthesizer on the independent dimensions provides the possibility of emotional speech synthesis with unlimited emotion categories. Our end-to-end neural speech synthesizer is based on the well-known Tacotron. Empirically, we have found the optimum network architecture for injecting the 3D PADs. Moreover, the PAD values are adjusted for the speech synthesis purpose.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
