Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis

Sho Inoue; Kun Zhou; Shuai Wang; and Haizhou Li

arXiv:2507.04598·cs.SD·July 8, 2025

Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis

Sho Inoue, Kun Zhou, Shuai Wang, and Haizhou Li

PDF

TL;DR

This paper presents a hierarchical emotion distribution prediction framework for text-to-speech synthesis, enabling multi-level emotional control and improving expressiveness by capturing speech emotion structure at different granularities.

Contribution

It introduces a novel multi-step hierarchical emotion distribution prediction module that refines local emotional variations using global context, enhancing emotional control in TTS systems.

Findings

01

Significantly improves emotional expressiveness in TTS.

02

Enables precise multi-level emotion control.

03

Validated through both objective and subjective evaluations.

Abstract

We investigate hierarchical emotion distribution (ED) for achieving multi-level quantitative control of emotion rendering in text-to-speech synthesis (TTS). We introduce a novel multi-step hierarchical ED prediction module that quantifies emotion variance at the utterance, word, and phoneme levels. By predicting emotion variance in a multi-step manner, we leverage global emotional context to refine local emotional variations, thereby capturing the intrinsic hierarchical structure of speech emotion. Our approach is validated through its integration into a variance adaptor and an external module design compatible with various TTS systems. Both objective and subjective evaluations demonstrate that the proposed framework significantly enhances emotional expressiveness and enables precise control of emotion rendering across multiple speech granularities.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.