Hierarchical Control of Emotion Rendering in Speech Synthesis

Sho Inoue; Kun Zhou; Shuai Wang; Haizhou Li

arXiv:2412.12498·cs.SD·June 24, 2025

Hierarchical Control of Emotion Rendering in Speech Synthesis

Sho Inoue, Kun Zhou, Shuai Wang, Haizhou Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces a hierarchical control framework for emotional speech synthesis that enables fine-grained emotion intensity management across phoneme, word, and utterance levels, improving expressiveness and naturalness.

Contribution

It proposes a novel flow-matching based TTS framework with hierarchical emotion distribution extraction for precise emotion control at multiple speech levels.

Findings

01

Effective emotion intensity control demonstrated through objective metrics.

02

Enhanced speech naturalness and emotional expressiveness confirmed by subjective evaluations.

03

Hierarchical ED embedding captures emotion variance across speech segments.

Abstract

Emotional text-to-speech synthesis (TTS) aims to generate realistic emotional speech from input text. However, quantitatively controlling multi-level emotion rendering remains challenging. In this paper, we propose a flow-matching based emotional TTS framework with a novel approach for emotion intensity modeling to facilitate fine-grained control over emotion rendering at the phoneme, word, and utterance levels. We introduce a hierarchical emotion distribution (ED) extractor that captures a quantifiable ED embedding across different speech segment levels. Additionally, we explore various acoustic features and assess their impact on emotion intensity modeling. During TTS training, the hierarchical ED embedding effectively captures the variance in emotion intensity from the reference audio and correlates it with linguistic and speaker information. The TTS model not only generates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shinshoji01/hed-project-page
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis