Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions

Kun Zhou; You Zhang; Dianwen Ng; Shengkui Zhao; Hao Wang; Bin Ma

arXiv:2409.16681·eess.AS·January 21, 2026

Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions

Kun Zhou, You Zhang, Dianwen Ng, Shengkui Zhao, Hao Wang, Bin Ma

PDF

Open Access

TL;DR

This paper introduces a language model-based TTS system that can generate a wide range of expressive emotions by controlling pleasure, arousal, and dominance dimensions, improving naturalness and diversity.

Contribution

It presents a novel framework that maps categorical emotion labels into a continuous emotional space, enabling flexible emotion control without needing explicit labels during TTS training.

Findings

01

Enhanced emotional expressiveness in synthesized speech

02

Improved naturalness and diversity over baseline models

03

Effective control of emotional dimensions in speech synthesis

Abstract

Emotional text-to-speech (TTS) systems sturggle to capture the full spectrum of human emotions due to the inherent complexity of emotional expressions and the limited coverage of existing emotion labels. To address this, we propose a language model-based TTS framework that synthesizes speech across a broad range of emotional styles. Our approach enables flexible user control along three continuous dimensions - pleasure, arousal, and dominance (PAD). To enable this, we train an emotional dimension predictor that maps categorical emotion labels in speech datasets into the PAD space, grounded in established psychological research. Importantly, while the emotional dimension predictor leverages categorical labels, the TTS framework itself does not require explict emotion labels during training. Objective and subjective evaluations demonstrate that our framework effectively generates more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems