Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and   Text Encoder Aggregation

Fengyu Yang; Jian Luan; Yujun Wang

arXiv:2110.09780·cs.SD·January 31, 2022·1 cites

Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and Text Encoder Aggregation

Fengyu Yang, Jian Luan, Yujun Wang

PDF

Open Access

TL;DR

This paper introduces SUS-constrained VAE and text encoder aggregation techniques to improve emotion embedding extraction and integration in speech synthesis, resulting in more expressive emotional speech.

Contribution

It proposes a novel constraint for VAE to enhance emotion embedding clustering and a method to aggregate encoder layer representations for better emotion expression.

Findings

01

Enhanced emotion embedding quality with better cluster cohesion

02

Improved emotional expressiveness in synthesized speech

03

Effective integration of syntactic and semantic information

Abstract

Learning emotion embedding from reference audio is a straightforward approach for multi-emotion speech synthesis in encoder-decoder systems. But how to get better emotion embedding and how to inject it into TTS acoustic model more effectively are still under investigation. In this paper, we propose an innovative constraint to help VAE extract emotion embedding with better cluster cohesion. Besides, the obtained emotion embedding is used as query to aggregate latent representations of all encoder layers via attention. Moreover, the queries from encoder layers themselves are also helpful. Experiments prove the proposed methods can enhance the encoding of comprehensive syntactic and semantic information and produce more expressive emotional speech.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing