Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis

Sho Inoue; Kun Zhou; Shuai Wang; and Haizhou Li

arXiv:2405.09171·cs.SD·May 16, 2024

Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis

Sho Inoue, Kun Zhou, Shuai Wang, and Haizhou Li

PDF

TL;DR

This paper introduces a hierarchical emotion distribution framework for text-to-speech synthesis, enabling nuanced emotion control at multiple linguistic levels, validated through comprehensive evaluations.

Contribution

It proposes a novel hierarchical emotion distribution approach that captures emotion intensity variations across phonemes, words, and utterances for improved emotion control in TTS.

Findings

01

Effective emotion prediction demonstrated by both objective and subjective metrics.

02

Enhanced emotional expressiveness in synthesized speech with controllable granularity.

03

Hierarchical ED outperforms previous global prosody-based methods.

Abstract

It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.