UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech

Jiaxuan Liu; Yang Xiang; Han Zhao; Xiangang Li; Yingying Gao; Shilei Zhang; Zhenhua Ling

arXiv:2505.10599·cs.LG·September 26, 2025

UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech

Jiaxuan Liu, Yang Xiang, Han Zhao, Xiangang Li, Yingying Gao, Shilei Zhang, Zhenhua Ling

PDF

Open Access 3 Reviews

TL;DR

UDDETTS introduces a unified framework that combines discrete and dimensional emotions for controllable, interpretable emotional speech synthesis, leveraging a semi-supervised training strategy across diverse datasets.

Contribution

The paper proposes UDDETTS, a novel model unifying discrete and dimensional emotions using the ADV space and semi-supervised learning for improved emotional TTS.

Findings

01

Achieves linear emotion control in three dimensions.

02

Outperforms existing methods in emotional speech synthesis quality.

03

Supports flexible emotion control using labels or ADV values.

Abstract

Recent large language models (LLMs) have made great progress in the field of text-to-speech (TTS), but they still face major challenges in synthesizing fine-grained emotional speech in an interpretable manner. Traditional methods rely on discrete emotion labels to control emotion categories and intensities, which cannot capture the complexity and continuity of human emotional perception and expression. The lack of large-scale emotional speech datasets with balanced emotion distributions and fine-grained emotional annotations often causes overfitting in synthesis models and impedes effective emotion control. To address these issues, we propose UDDETTS, a universal LLM framework unifying discrete and dimensional emotions for controllable emotional TTS. This model introduces the interpretable Arousal-Dominance-Valence (ADV) space for dimensional emotion description and supports emotion…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper tackles a critical and timely problem in expressive speech synthesis, contious and dimensional control is a clear and important direction for the field. 2. The semi-supervised learning strategy is an effective solution to extend the training to larger-scale dataset, while only part of the data is well labeled.

Weaknesses

1. Although this article compares many different baselines, the reasonableness of the comparison is still not clear to me. A more reasonable comparison would be to add the adv prediction and control modules to the corresponding frameworks, which would better illustrate the universality of the article's contribution. 2. Some details are not very clear. For example, in Table 3, preference scores are given for two systems, but it is uncertain whether the same backbone is used for the corresponding

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper proposes a LLM-based TTS framework to explicitly unify discrete and dimensional emotions, addressing a key limitation in prior work of emotional TTS. 2. Introducing the interpretable ADV space to LLM-based TTS is a meaningful step toward continuous, decoupled emotion control, addressing limitations of discrete-label methods. The nonlinear binning and semi-supervised fusion of annotations effectively tackle data imbalance and sparsity. 3. Evaluations across three tasks use diverse

Weaknesses

1. The novelty is limited. The work is built directly upon the architecture of models like Spark-TTS and CosyVoice. The addition of ADV control seems to be an incremental improvement rather than a novel framework. 2. The core components lack detailed explanation. For ADV quantizer, the nonlinear binning based on clustering is a potential key innovation, but its derivation and relationship to solving sparsity/imbalance are unclear in the main text. 3. The semi-supervised strategy for mixing spo

Reviewer 03Rating 2Confidence 5

Strengths

The work introduces a potentially useful direction for controllable emotional TTS by modeling ADV in LLMs.

Weaknesses

Your proposed UDDTTS does not outperform other approaches in terms of MOS, UTMOS, WER, SS, and STOI. I suggest further improving these metrics through more refined method design. Methodology is not well written. Please define the symbols before using them. I am confused with the method design. The generated speech quality is not good with unclear pronunciations, which is not common in the existing TTS models. I am wondering if including ADV is the reason why the speech intelligence is getting

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Mental Health via Writing · Sentiment Analysis and Opinion Mining