Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning   without Using Parallel Corpus for Unseen Speakers

Zhaoyu Liu; Brian Mak

arXiv:1911.11601·eess.AS·November 27, 2019·19 cites

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

Zhaoyu Liu, Brian Mak

PDF

Open Access

TL;DR

This paper presents a cross-lingual multi-speaker TTS system that synthesizes high-quality speech for both native and unseen speakers in English and Mandarin without relying on parallel corpora, using a modular approach with embeddings.

Contribution

It introduces a novel multi-component TTS system conditioned on speaker, language, and tone embeddings, enabling high-quality voice cloning for unseen speakers across languages without parallel data.

Findings

01

High naturalness and intelligibility for native/foreign seen/unseen speakers.

02

Good speaker similarity for native and accented speech.

03

WaveNet vocoder trained on Cantonese generalizes well to Mandarin and English.

Abstract

We investigate a novel cross-lingual multi-speaker text-to-speech synthesis approach for generating high-quality native or accented speech for native/foreign seen/unseen speakers in English and Mandarin. The system consists of three separately trained components: an x-vector speaker encoder, a Tacotron-based synthesizer and a WaveNet vocoder. It is conditioned on 3 kinds of embeddings: (1) speaker embedding so that the system can be trained with speech from many speakers will little data from each speaker; (2) language embedding with shared phoneme inputs; (3) stress and tone embedding which improves naturalness of synthesized speech, especially for a tonal language like Mandarin. By adjusting the various embeddings, MOS results show that our method can generate high-quality natural and intelligible native speech for native/foreign seen/unseen speakers. Intelligibility and naturalness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques