XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

Edresson Casanova; Kelly Davis; Eren G\"olge; G\"orkem G\"oknar,; Iulian Gulea; Logan Hart; Aya Aljafari; Joshua Meyer; Reuben Morais; Samuel; Olayemi; Julian Weber

arXiv:2406.04904·eess.AS·June 10, 2024·2 cites

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

Edresson Casanova, Kelly Davis, Eren G\"olge, G\"orkem G\"oknar,, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel, Olayemi, Julian Weber

PDF

Open Access 1 Repo 5 Models 2 Datasets

TL;DR

This paper introduces XTTS, a multilingual zero-shot text-to-speech system trained on 16 languages, achieving state-of-the-art results and enabling voice cloning and fast inference across diverse languages.

Contribution

The paper presents XTTS, a novel multilingual ZS-TTS model with modifications for improved training, voice cloning, and inference, covering low-resource languages.

Findings

01

Achieved SOTA results in 16 languages

02

Supports zero-shot voice cloning across multiple languages

03

Enables faster training and inference processes

Abstract

Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages, limiting the applications of these models in most of the low/medium resource languages. In this paper, we aim to alleviate this issue by proposing and making publicly available the XTTS system. Our method builds upon the Tortoise model and adds several novel modifications to enable multilingual training, improve voice cloning, and enable faster training and inference. XTTS was trained in 16 languages and achieved state-of-the-art (SOTA) results in most of them.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Edresson/ZS-TTS-Evaluation
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Natural Language Processing Techniques