MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for   Zero-Shot Speech Synthesis

Ziyue Jiang; Yi Ren; Ruiqi Li; Shengpeng Ji; Boyang Zhang; Zhenhui Ye,; Chen Zhang; Bai Jionghao; Xiaoda Yang; Jialong Zuo; Yu Zhang; Rui Liu; Xiang; Yin; Zhou Zhao

arXiv:2502.18924·eess.AS·March 31, 2025

MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye,, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, Rui Liu, Xiang, Yin, Zhou Zhao

PDF

Open Access 4 Models

TL;DR

MegaTTS 3 introduces a sparse alignment algorithm and advanced guidance strategies to significantly improve zero-shot speech synthesis quality, naturalness, and control, while reducing computational steps needed for high-quality output.

Contribution

The paper presents a novel sparse alignment method and guidance strategies that enhance zero-shot TTS performance and naturalness without relying on forced alignments.

Findings

01

Achieves state-of-the-art zero-shot TTS quality.

02

Supports flexible accent intensity control.

03

Generates high-quality speech with only 8 sampling steps.

Abstract

While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{MegaTTS 3}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Face recognition and analysis

MethodsDiffusion · ADaptive gradient method with the OPTimal convergence rate