EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical   Emotion Vector for Controllable Emotional Text-to-Speech

Deok-Hyeon Cho; Hyung-Seok Oh; Seung-Bin Kim; Sang-Hoon Lee,; Seong-Whan Lee

arXiv:2406.07803·cs.SD·November 6, 2024

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Sang-Hoon Lee,, Seong-Whan Lee

PDF

Open Access 1 Repo

TL;DR

EmoSphere-TTS introduces a novel spherical emotion vector approach for controllable, nuanced emotional speech synthesis, enabling manipulation of emotional style and intensity without human annotations.

Contribution

The paper presents a new spherical emotion vector and dual adversarial network for high-quality, controllable emotional TTS without requiring human-labeled data.

Findings

01

Effective control of emotional style and intensity in speech synthesis.

02

High-quality expressive speech generated with the proposed method.

03

Model captures complex emotional nuances without human annotations.

Abstract

Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Choddeok/EmoSphere-TTS
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Speech Recognition and Synthesis · Text and Document Classification Technologies

MethodsFocus