Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions
Kun Zhou, You Zhang, Dianwen Ng, Shengkui Zhao, Hao Wang, Bin Ma

TL;DR
This paper introduces a language model-based TTS system that can generate a wide range of expressive emotions by controlling pleasure, arousal, and dominance dimensions, improving naturalness and diversity.
Contribution
It presents a novel framework that maps categorical emotion labels into a continuous emotional space, enabling flexible emotion control without needing explicit labels during TTS training.
Findings
Enhanced emotional expressiveness in synthesized speech
Improved naturalness and diversity over baseline models
Effective control of emotional dimensions in speech synthesis
Abstract
Emotional text-to-speech (TTS) systems sturggle to capture the full spectrum of human emotions due to the inherent complexity of emotional expressions and the limited coverage of existing emotion labels. To address this, we propose a language model-based TTS framework that synthesizes speech across a broad range of emotional styles. Our approach enables flexible user control along three continuous dimensions - pleasure, arousal, and dominance (PAD). To enable this, we train an emotional dimension predictor that maps categorical emotion labels in speech datasets into the PAD space, grounded in established psychological research. Importantly, while the emotional dimension predictor leverages categorical labels, the TTS framework itself does not require explict emotion labels during training. Objective and subjective evaluations demonstrate that our framework effectively generates more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
