MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo,, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, Zhizheng Wu

TL;DR
MaskGCT is a novel non-autoregressive zero-shot TTS model that generates speech from text without explicit alignment or duration prediction, achieving superior quality and naturalness in large-scale experiments.
Contribution
Introduces MaskGCT, a two-stage, mask-and-predict non-autoregressive TTS model that eliminates the need for explicit alignment and duration modeling, improving zero-shot speech synthesis.
Findings
Outperforms state-of-the-art zero-shot TTS systems in quality and naturalness
Demonstrates robustness and high intelligibility on 100K hours of in-the-wild speech
Operates efficiently with parallel token generation during inference
Abstract
The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive and non-autoregressive systems. The autoregressive systems implicitly model duration but exhibit certain deficiencies in robustness and lack of duration controllability. Non-autoregressive systems require explicit alignment information between text and speech during training and predict durations for linguistic units (e.g. phone), which may compromise their naturalness. In this paper, we introduce Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision, as well as phone-level duration prediction. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Linear Layer · Adam · Dropout · Layer Normalization · Dense Connections
