Language Model-Based Emotion Prediction Methods for Emotional Speech   Synthesis Systems

Hyun-Wook Yoon; Ohsung Kwon; Hoyeon Lee; Ryuichi Yamamoto; Eunwoo; Song; Jae-Min Kim; and Min-Jae Hwang

arXiv:2206.15067·cs.SD·July 4, 2022

Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems

Hyun-Wook Yoon, Ohsung Kwon, Hoyeon Lee, Ryuichi Yamamoto, Eunwoo, Song, Jae-Min Kim, and Min-Jae Hwang

PDF

Open Access

TL;DR

This paper introduces a novel emotional TTS system that leverages GPT-3 to predict emotion attributes directly from text, enabling paragraph-level emotional speech synthesis without auxiliary inputs.

Contribution

It presents a GPT-3 based emotion prediction method that estimates emotion class and strength directly from text, improving emotional speech synthesis.

Findings

01

Effective paragraph-level emotional speech generation

02

No need for auxiliary emotion labels or classes

03

Enhanced emotional context understanding

Abstract

This paper proposes an effective emotional text-to-speech (TTS) system with a pre-trained language model (LM)-based emotion prediction method. Unlike conventional systems that require auxiliary inputs such as manually defined emotion classes, our system directly estimates emotion-related attributes from the input text. Specifically, we utilize generative pre-trained transformer (GPT)-3 to jointly predict both an emotion class and its strength in representing emotions coarse and fine properties, respectively. Then, these attributes are combined in the emotional embedding space and used as conditional features of the TTS model for generating output speech signals. Consequently, the proposed system can produce emotional speech only from text without any auxiliary inputs. Furthermore, because the GPT-3 enables to capture emotional context among the consecutive sentences, the proposed method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Speech Recognition and Synthesis