MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec   Transformer

Yuancheng Wang; Haoyue Zhan; Liwei Liu; Ruihong Zeng; Haotian Guo,; Jiachen Zheng; Qiang Zhang; Xueyao Zhang; Shunsi Zhang; Zhizheng Wu

arXiv:2409.00750·cs.SD·October 22, 2024

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo,, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, Zhizheng Wu

PDF

Open Access 1 Repo 6 Models 2 Datasets

TL;DR

MaskGCT is a novel non-autoregressive zero-shot TTS model that generates speech from text without explicit alignment or duration prediction, achieving superior quality and naturalness in large-scale experiments.

Contribution

Introduces MaskGCT, a two-stage, mask-and-predict non-autoregressive TTS model that eliminates the need for explicit alignment and duration modeling, improving zero-shot speech synthesis.

Findings

01

Outperforms state-of-the-art zero-shot TTS systems in quality and naturalness

02

Demonstrates robustness and high intelligibility on 100K hours of in-the-wild speech

03

Operates efficiently with parallel token generation during inference

Abstract

The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive and non-autoregressive systems. The autoregressive systems implicitly model duration but exhibit certain deficiencies in robustness and lack of duration controllability. Non-autoregressive systems require explicit alignment information between text and speech during training and predict durations for linguistic units (e.g. phone), which may compromise their naturalness. In this paper, we introduce Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision, as well as phone-level duration prediction. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

open-mmlab/amphion
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Linear Layer · Adam · Dropout · Layer Normalization · Dense Connections