Emotional Prosody Control for Speech Generation
Sarath Sivaprasad, Saiteja Kosgi, Vineet Gandhi

TL;DR
This paper introduces a novel text-to-speech system that enables fine-grained, continuous emotional control in speech synthesis, allowing for style transfer and emotion manipulation across unseen speakers without degrading speech quality.
Contribution
It extends FastSpeech2 to support multi-speaker and continuous emotion control in a unified framework, enabling more natural and expressive speech synthesis.
Findings
Effective emotion control in a continuous Arousal-Valence space.
Successful style transfer to unseen speakers.
Maintains high speech quality with emotional variation.
Abstract
Machine-generated speech is characterized by its limited or unnatural emotional variation. Current text to speech systems generates speech with either a flat emotion, emotion selected from a predefined set, average variation learned from prosody sequences in training data or transferred from a source style. We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space (Arousal-Valence space). The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion. We show that the system works on emotion unseen during training and can scale to previously unseen speakers given his/her speech sample. Our work expands the horizon of the state-of-the-art FastSpeech2 backbone to a multi-speaker setting and gives it much-coveted continuous (and interpretable) affective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
