Semi-supervised learning for continuous emotional intensity controllable   speech synthesis with disentangled representations

Yoori Oh; Juheon Lee; Yoseob Han; Kyogu Lee

arXiv:2211.06160·eess.AS·May 30, 2023

Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations

Yoori Oh, Juheon Lee, Yoseob Han, Kyogu Lee

PDF

Open Access

TL;DR

This paper introduces a semi-supervised learning approach for emotional speech synthesis that enables fine-grained control of emotional intensity by disentangling emotional features, resulting in more natural and controllable speech output.

Contribution

It proposes a novel semi-supervised method to control continuous emotional intensity by disentangling emotional features in speech synthesis models.

Findings

01

Improved controllability of emotional intensity in speech synthesis.

02

Enhanced naturalness of generated speech.

03

Effective disentanglement of emotional features in the embedding space.

Abstract

Recent text-to-speech models have reached the level of generating natural speech similar to what humans say. But there still have limitations in terms of expressiveness. The existing emotional speech synthesis models have shown controllability using interpolated features with scaling parameters in emotional latent space. However, the emotional latent space generated from the existing models is difficult to control the continuous emotional intensity because of the entanglement of features like emotions, speakers, etc. In this paper, we propose a novel method to control the continuous intensity of emotions using semi-supervised learning. The model learns emotions of intermediate intensity using pseudo-labels generated from phoneme-level sequences of speech information. An embedding space built from the proposed model satisfies the uniform grid geometry with an emotional basis. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis