YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice   Conversion for everyone

Edresson Casanova; Julian Weber; Christopher Shulby; Arnaldo Candido; Junior; Eren G\"olge; Moacir Antonelli Ponti

arXiv:2112.02418·cs.SD·May 2, 2023·30 cites

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Edresson Casanova, Julian Weber, Christopher Shulby, Arnaldo Candido, Junior, Eren G\"olge, Moacir Antonelli Ponti

PDF

Open Access 3 Repos 9 Models

TL;DR

YourTTS introduces a multilingual, zero-shot multi-speaker TTS and voice conversion system that achieves state-of-the-art results, works with low-resource languages, and can be fine-tuned with minimal data.

Contribution

It extends the VITS model with novel modifications for zero-shot multilingual and multi-speaker TTS, enabling high-quality synthesis with minimal data and in low-resource languages.

Findings

01

Achieved SOTA results in zero-shot multi-speaker TTS.

02

Comparable results to SOTA in zero-shot voice conversion.

03

Effective fine-tuning with less than 1 minute of speech.

Abstract

YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling

MethodsUSD Coin Customer Service Number +1-833-534-1729 · Normalizing Flows · Transformer · HiFi-GAN