Reinforcement Learning for Emotional Text-to-Speech Synthesis with   Improved Emotion Discriminability

Rui Liu; Berrak Sisman; Haizhou Li

arXiv:2104.01408·cs.CL·June 15, 2021·5 cites

Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability

Rui Liu, Berrak Sisman, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces i-ETTS, a novel reinforcement learning-based interactive training method for emotional text-to-speech synthesis that significantly improves the perceptual recognizability of intended emotions.

Contribution

It presents the first application of reinforcement learning in ETTS, enhancing emotion discriminability through interaction with a speech emotion recognition model.

Findings

01

i-ETTS outperforms state-of-the-art baselines in emotion accuracy

02

The iterative training strategy improves speech emotion style quality

03

Reinforcement learning effectively enhances perceptual emotion recognition

Abstract

Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. However, the generated voice is often not perceptually identifiable by its intended emotion category. To address this problem, we propose a new interactive training paradigm for ETTS, denoted as i-ETTS, which seeks to directly improve the emotion discriminability by interacting with a speech emotion recognition (SER) model. Moreover, we formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization. Experimental results demonstrate that the proposed i-ETTS outperforms the state-of-the-art baselines by rendering speech with more accurate emotion style. To our best knowledge, this is the first study of reinforcement learning in emotional text-to-speech synthesis.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing