XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
Edresson Casanova, Kelly Davis, Eren G\"olge, G\"orkem G\"oknar,, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel, Olayemi, Julian Weber

TL;DR
This paper introduces XTTS, a multilingual zero-shot text-to-speech system trained on 16 languages, achieving state-of-the-art results and enabling voice cloning and fast inference across diverse languages.
Contribution
The paper presents XTTS, a novel multilingual ZS-TTS model with modifications for improved training, voice cloning, and inference, covering low-resource languages.
Findings
Achieved SOTA results in 16 languages
Supports zero-shot voice cloning across multiple languages
Enables faster training and inference processes
Abstract
Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages, limiting the applications of these models in most of the low/medium resource languages. In this paper, we aim to alleviate this issue by proposing and making publicly available the XTTS system. Our method builds upon the Tortoise model and adds several novel modifications to enable multilingual training, improve voice cloning, and enable faster training and inference. XTTS was trained in 16 languages and achieved state-of-the-art (SOTA) results in most of them.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Natural Language Processing Techniques
