Latent linguistic embedding for cross-lingual text-to-speech and voice   conversion

Hieu-Thi Luong; Junichi Yamagishi

arXiv:2010.03717·eess.AS·October 9, 2020

Latent linguistic embedding for cross-lingual text-to-speech and voice conversion

Hieu-Thi Luong, Junichi Yamagishi

PDF

Open Access

TL;DR

This paper explores using a latent linguistic embedding within the NAUTILUS voice cloning system to enable high-quality cross-lingual text-to-speech and voice conversion, creating new voices in different languages without extra steps.

Contribution

It introduces a method leveraging a well-trained English latent linguistic embedding for cross-lingual TTS and VC, demonstrating high speaker similarity and seamless application across languages.

Findings

01

High speaker similarity in cross-lingual VC

02

Seamless cross-lingual TTS without additional steps

03

Variable perceived naturalness across speakers

Abstract

As the recently proposed voice cloning system, NAUTILUS, is capable of cloning unseen voices using untranscribed speech, we investigate the feasibility of using it to develop a unified cross-lingual TTS/VC system. Cross-lingual speech generation is the scenario in which speech utterances are generated with the voices of target speakers in a language not spoken by them originally. This type of system is not simply cloning the voice of the target speaker, but essentially creating a new voice that can be considered better than the original under a specific framing. By using a well-trained English latent linguistic embedding to create a cross-lingual TTS and VC system for several German, Finnish, and Mandarin speakers included in the Voice Conversion Challenge 2020, we show that our method not only creates cross-lingual VC with high speaker similarity but also can be seamlessly used for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling