DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for   Text-to-Speech -- A Study between English and Mandarin

Tao Li; Chenxu Hu; Jian Cong; Xinfa Zhu; Jingbei Li; Qiao Tian; Yuping; Wang; Lei Xie

arXiv:2309.00883·cs.SD·September 6, 2023·1 cites

DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin

Tao Li, Chenxu Hu, Jian Cong, Xinfa Zhu, Jingbei Li, Qiao Tian, Yuping, Wang, Lei Xie

PDF

Open Access

TL;DR

DiCLET-TTS is a diffusion model-based approach that enhances cross-lingual text-to-speech by transferring emotion and reducing foreign accent issues, using novel disentangling and conditioning techniques.

Contribution

The paper introduces a diffusion model for cross-lingual emotion transfer in TTS, featuring a new emotion disentangling module and a condition-enhanced decoder for improved naturalness.

Findings

01

Outperforms various competitive models in cross-lingual emotion transfer

02

Effectively reduces foreign accent in cross-lingual speech synthesis

03

Enhances emotional expressiveness in synthesized speech

Abstract

While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer method that can transfer emotion from a source speaker to the intra- and cross-lingual target speakers. Specifically, to relieve the foreign accent problem while improving the emotion expressiveness, the terminal distribution of the forward diffusion process is parameterized into a speaker-irrelevant but emotion-related linguistic prior by a prior text encoder with the emotion embedding as a condition. To address the weaker emotional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhonetics and Phonology Research · Speech Recognition and Synthesis · Sentiment Analysis and Opinion Mining