Combining speakers of multiple languages to improve quality of neural   voices

Javier Latorre; Charlotte Bailleul; Tuuli Morrill; Alistair Conkie,; Yannis Stylianou

arXiv:2108.07737·cs.CL·August 18, 2021

Combining speakers of multiple languages to improve quality of neural voices

Javier Latorre, Charlotte Bailleul, Tuuli Morrill, Alistair Conkie,, Yannis Stylianou

PDF

Open Access

TL;DR

This paper develops a multi-lingual neural TTS system that leverages multiple speakers and languages to enhance speech quality and enable cross-lingual synthesis, especially with limited data per language.

Contribution

It introduces architectures and training procedures for multi-speaker, multi-lingual TTS that improve quality with less data and support cross-lingual synthesis, validated on a large multi-language dataset.

Findings

01

Fine-tuning with less than 40% of data improves quality.

02

Cross-lingual synthesis achieves 80% of native speaker quality.

03

Multi-lingual models outperform single-speaker models in low-data scenarios.

Abstract

In this work, we explore multiple architectures and training procedures for developing a multi-speaker and multi-lingual neural TTS system with the goals of a) improving the quality when the available data in the target language is limited and b) enabling cross-lingual synthesis. We report results from a large experiment using 30 speakers in 8 different languages across 15 different locales. The system is trained on the same amount of data per speaker. Compared to a single-speaker model, when the suggested system is fine tuned to a speaker, it produces significantly better quality in most of the cases while it only uses less than $40%$ of the speaker's data used to build the single-speaker model. In cross-lingual synthesis, on average, the generated quality is within $80%$ of native single-speaker models, in terms of Mean Opinion Score.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing