Emotional Prosody Control for Speech Generation

Sarath Sivaprasad; Saiteja Kosgi; Vineet Gandhi

arXiv:2111.04730·eess.AS·November 10, 2021

Emotional Prosody Control for Speech Generation

Sarath Sivaprasad, Saiteja Kosgi, Vineet Gandhi

PDF

TL;DR

This paper introduces a novel text-to-speech system that enables fine-grained, continuous emotional control in speech synthesis, allowing for style transfer and emotion manipulation across unseen speakers without degrading speech quality.

Contribution

It extends FastSpeech2 to support multi-speaker and continuous emotion control in a unified framework, enabling more natural and expressive speech synthesis.

Findings

01

Effective emotion control in a continuous Arousal-Valence space.

02

Successful style transfer to unseen speakers.

03

Maintains high speech quality with emotional variation.

Abstract

Machine-generated speech is characterized by its limited or unnatural emotional variation. Current text to speech systems generates speech with either a flat emotion, emotion selected from a predefined set, average variation learned from prosody sequences in training data or transferred from a source style. We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space (Arousal-Valence space). The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion. We show that the system works on emotion unseen during training and can scale to previously unseen speakers given his/her speech sample. Our work expands the horizon of the state-of-the-art FastSpeech2 backbone to a multi-speaker setting and gives it much-coveted continuous (and interpretable) affective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.