UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech
Jiaxuan Liu, Yang Xiang, Han Zhao, Xiangang Li, Yingying Gao, Shilei Zhang, Zhenhua Ling

TL;DR
UDDETTS introduces a unified framework that combines discrete and dimensional emotions for controllable, interpretable emotional speech synthesis, leveraging a semi-supervised training strategy across diverse datasets.
Contribution
The paper proposes UDDETTS, a novel model unifying discrete and dimensional emotions using the ADV space and semi-supervised learning for improved emotional TTS.
Findings
Achieves linear emotion control in three dimensions.
Outperforms existing methods in emotional speech synthesis quality.
Supports flexible emotion control using labels or ADV values.
Abstract
Recent large language models (LLMs) have made great progress in the field of text-to-speech (TTS), but they still face major challenges in synthesizing fine-grained emotional speech in an interpretable manner. Traditional methods rely on discrete emotion labels to control emotion categories and intensities, which cannot capture the complexity and continuity of human emotional perception and expression. The lack of large-scale emotional speech datasets with balanced emotion distributions and fine-grained emotional annotations often causes overfitting in synthesis models and impedes effective emotion control. To address these issues, we propose UDDETTS, a universal LLM framework unifying discrete and dimensional emotions for controllable emotional TTS. This model introduces the interpretable Arousal-Dominance-Valence (ADV) space for dimensional emotion description and supports emotion…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper tackles a critical and timely problem in expressive speech synthesis, contious and dimensional control is a clear and important direction for the field. 2. The semi-supervised learning strategy is an effective solution to extend the training to larger-scale dataset, while only part of the data is well labeled.
1. Although this article compares many different baselines, the reasonableness of the comparison is still not clear to me. A more reasonable comparison would be to add the adv prediction and control modules to the corresponding frameworks, which would better illustrate the universality of the article's contribution. 2. Some details are not very clear. For example, in Table 3, preference scores are given for two systems, but it is uncertain whether the same backbone is used for the corresponding
1. The paper proposes a LLM-based TTS framework to explicitly unify discrete and dimensional emotions, addressing a key limitation in prior work of emotional TTS. 2. Introducing the interpretable ADV space to LLM-based TTS is a meaningful step toward continuous, decoupled emotion control, addressing limitations of discrete-label methods. The nonlinear binning and semi-supervised fusion of annotations effectively tackle data imbalance and sparsity. 3. Evaluations across three tasks use diverse
1. The novelty is limited. The work is built directly upon the architecture of models like Spark-TTS and CosyVoice. The addition of ADV control seems to be an incremental improvement rather than a novel framework. 2. The core components lack detailed explanation. For ADV quantizer, the nonlinear binning based on clustering is a potential key innovation, but its derivation and relationship to solving sparsity/imbalance are unclear in the main text. 3. The semi-supervised strategy for mixing spo
The work introduces a potentially useful direction for controllable emotional TTS by modeling ADV in LLMs.
Your proposed UDDTTS does not outperform other approaches in terms of MOS, UTMOS, WER, SS, and STOI. I suggest further improving these metrics through more refined method design. Methodology is not well written. Please define the symbols before using them. I am confused with the method design. The generated speech quality is not good with unclear pronunciations, which is not common in the existing TTS models. I am wondering if including ADV is the reason why the speech intelligence is getting
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Mental Health via Writing · Sentiment Analysis and Opinion Mining
