EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens
Joonyong Park, Kenichi Nakamura

TL;DR
EmoSSLSphere is a new multilingual emotional speech synthesis framework that uses spherical emotion vectors and SSL-derived discrete tokens to improve emotional control, cross-lingual transfer, and speech quality.
Contribution
It introduces a novel combination of spherical emotion encoding and SSL-based discrete tokens for enhanced multilingual emotional TTS.
Findings
Significant improvements in speech intelligibility and spectral fidelity.
Enhanced emotional expressiveness and naturalness in synthesized speech.
Effective cross-lingual emotion transfer demonstrated on English and Japanese corpora.
Abstract
This paper introduces EmoSSLSphere, a novel framework for multilingual emotional text-to-speech (TTS) synthesis that combines spherical emotion vectors with discrete token features derived from self-supervised learning (SSL). By encoding emotions in a continuous spherical coordinate space and leveraging SSL-based representations for semantic and acoustic modeling, EmoSSLSphere enables fine-grained emotional control, effective cross-lingual emotion transfer, and robust preservation of speaker identity. We evaluate EmoSSLSphere on English and Japanese corpora, demonstrating significant improvements in speech intelligibility, spectral fidelity, prosodic consistency, and overall synthesis quality. Subjective evaluations further confirm that our method outperforms baseline models in terms of naturalness and emotional expressiveness, underscoring its potential as a scalable solution for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
