VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual   Text-to-Speech

Ashishkumar Gudmalwar; Nirmesh Shah; Sai Akarsh; Pankaj Wasnik; Rajiv; Ratn Shah

arXiv:2406.08076·eess.AS·June 13, 2024

VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech

Ashishkumar Gudmalwar, Nirmesh Shah, Sai Akarsh, Pankaj Wasnik, Rajiv, Ratn Shah

PDF

Open Access

TL;DR

VECL-TTS is an end-to-end cross-lingual TTS system that enables simultaneous control of voice identity and emotional style transfer across languages, improving speech synthesis quality with novel consistency losses.

Contribution

The paper introduces VECL-TTS, a novel system that jointly controls voice identity and emotional style in cross-lingual TTS using multilingual data and consistency losses.

Findings

01

Achieved 8.83% relative improvement over SOTA methods.

02

Effectively transfers voice and emotion across languages.

03

Enhances speech quality with content and style consistency losses.

Abstract

Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transferring them to a target language using cross-lingual TTS techniques. While previous approaches have mainly concentrated on controlling voice identity within the cross-lingual TTS framework, there has been limited work on incorporating emotion and voice identity together. To this end, we introduce an end-to-end Voice Identity and Emotional Style Controllable Cross-Lingual (VECL) TTS system using multilingual speakers and an emotion embedding network. Moreover, we introduce content and style consistency losses to enhance the quality of synthesized speech further. The proposed system achieved an average relative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Phonetics and Phonology Research