EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via   Emotion-Adaptive Spherical Vector

Deok-Hyeon Cho; Hyung-Seok Oh; Seung-Bin Kim; Seong-Whan Lee

arXiv:2411.02625·cs.SD·April 18, 2025

EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

PDF

Open Access 1 Repo

TL;DR

EmoSphere++ is a novel zero-shot TTS system that controls emotional style and intensity using an emotion-adaptive spherical vector, enabling natural and expressive speech synthesis without extensive manual annotations.

Contribution

It introduces a new emotion-adaptive spherical vector and a multi-level style encoder for improved emotion control and speaker generalization in zero-shot TTS.

Findings

01

Effective emotion transfer in zero-shot scenarios

02

High-quality expressive speech synthesis achieved

03

Generalizes well to unseen speakers

Abstract

Emotional text-to-speech (TTS) technology has achieved significant progress in recent years; however, challenges remain owing to the inherent complexity of emotions and limitations of the available emotional speech datasets and models. Previous studies typically relied on limited emotional speech datasets or required extensive manual annotations, restricting their ability to generalize across different speakers and emotional styles. In this paper, we present EmoSphere++, an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech. We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation. Moreover, we propose a multi-level style encoder that can ensure effective generalization for both seen and unseen speakers. We also introduce additional loss functions to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Choddeok/EmoSpherepp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing