Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis
Sho Inoue, Kun Zhou, Shuai Wang, and Haizhou Li

TL;DR
This paper presents a hierarchical emotion distribution prediction framework for text-to-speech synthesis, enabling multi-level emotional control and improving expressiveness by capturing speech emotion structure at different granularities.
Contribution
It introduces a novel multi-step hierarchical emotion distribution prediction module that refines local emotional variations using global context, enhancing emotional control in TTS systems.
Findings
Significantly improves emotional expressiveness in TTS.
Enables precise multi-level emotion control.
Validated through both objective and subjective evaluations.
Abstract
We investigate hierarchical emotion distribution (ED) for achieving multi-level quantitative control of emotion rendering in text-to-speech synthesis (TTS). We introduce a novel multi-step hierarchical ED prediction module that quantifies emotion variance at the utterance, word, and phoneme levels. By predicting emotion variance in a multi-step manner, we leverage global emotional context to refine local emotional variations, thereby capturing the intrinsic hierarchical structure of speech emotion. Our approach is validated through its integration into a variance adaptor and an external module design compatible with various TTS systems. Both objective and subjective evaluations demonstrate that the proposed framework significantly enhances emotional expressiveness and enables precise control of emotion rendering across multiple speech granularities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
