EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens

Joonyong Park; Kenichi Nakamura

arXiv:2508.11273·eess.AS·October 7, 2025

EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens

Joonyong Park, Kenichi Nakamura

PDF

Open Access

TL;DR

EmoSSLSphere is a new multilingual emotional speech synthesis framework that uses spherical emotion vectors and SSL-derived discrete tokens to improve emotional control, cross-lingual transfer, and speech quality.

Contribution

It introduces a novel combination of spherical emotion encoding and SSL-based discrete tokens for enhanced multilingual emotional TTS.

Findings

01

Significant improvements in speech intelligibility and spectral fidelity.

02

Enhanced emotional expressiveness and naturalness in synthesized speech.

03

Effective cross-lingual emotion transfer demonstrated on English and Japanese corpora.

Abstract

This paper introduces EmoSSLSphere, a novel framework for multilingual emotional text-to-speech (TTS) synthesis that combines spherical emotion vectors with discrete token features derived from self-supervised learning (SSL). By encoding emotions in a continuous spherical coordinate space and leveraging SSL-based representations for semantic and acoustic modeling, EmoSSLSphere enables fine-grained emotional control, effective cross-lingual emotion transfer, and robust preservation of speaker identity. We evaluate EmoSSLSphere on English and Japanese corpora, demonstrating significant improvements in speech intelligibility, spectral fidelity, prosodic consistency, and overall synthesis quality. Subjective evaluations further confirm that our method outperforms baseline models in terms of naturalness and emotional expressiveness, underscoring its potential as a scalable solution for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis