METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and   Cross-lingual Emotion Transfer

Xinfa Zhu; Yi Lei; Tao Li; Yongmao Zhang; Hongbin Zhou; Heng Lu; Lei; Xie

arXiv:2307.15951·eess.AS·August 1, 2023·IEEE ACM Trans. Audio Speech Lang. Process.·1 cites

METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer

Xinfa Zhu, Yi Lei, Tao Li, Yongmao Zhang, Hongbin Zhou, Heng Lu, Lei, Xie

PDF

Open Access

TL;DR

METTS is a novel multilingual emotional TTS system that effectively transfers emotion across speakers and languages by disentangling speech features and employing innovative techniques to improve naturalness and expressiveness.

Contribution

The paper introduces METTS, a new model that enables cross-speaker and cross-lingual emotion transfer in TTS by disentangling speech features and using formant shift and vector quantization.

Findings

01

Effective cross-lingual emotion transfer demonstrated

02

Disentanglement of speaker timbre and emotion achieved

03

Enhanced naturalness and emotion diversity in synthetic speech

Abstract

Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer - the heavy entanglement of speaker timbre, emotion, and language factors in the speech signal will make a system produce cross-lingual synthetic speech with an undesired foreign accent and weak emotion expressiveness. This paper proposes the Multilingual Emotional TTS (METTS) model to mitigate these problems, realizing both cross-speaker and cross-lingual emotion transfer. Specifically, METTS takes DelightfulTTS as the backbone model and proposes the following designs. First, to alleviate the foreign accent problem, METTS introduces multi-scale emotion modeling to disentangle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis