Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations
Yoori Oh, Juheon Lee, Yoseob Han, Kyogu Lee

TL;DR
This paper introduces a semi-supervised learning approach for emotional speech synthesis that enables fine-grained control of emotional intensity by disentangling emotional features, resulting in more natural and controllable speech output.
Contribution
It proposes a novel semi-supervised method to control continuous emotional intensity by disentangling emotional features in speech synthesis models.
Findings
Improved controllability of emotional intensity in speech synthesis.
Enhanced naturalness of generated speech.
Effective disentanglement of emotional features in the embedding space.
Abstract
Recent text-to-speech models have reached the level of generating natural speech similar to what humans say. But there still have limitations in terms of expressiveness. The existing emotional speech synthesis models have shown controllability using interpolated features with scaling parameters in emotional latent space. However, the emotional latent space generated from the existing models is difficult to control the continuous emotional intensity because of the entanglement of features like emotions, speakers, etc. In this paper, we propose a novel method to control the continuous intensity of emotions using semi-supervised learning. The model learns emotions of intermediate intensity using pseudo-labels generated from phoneme-level sequences of speech information. An embedding space built from the proposed model satisfies the uniform grid geometry with an emotional basis. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
