M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis

Xiaopeng Wang; Chunyu Qiang; Ruibo Fu; Zhengqi Wen; Xuefei Liu; Yukun Liu; Yuzhe Liang; Kang Yin; Yuankun Xie; Heng Xie; Chenxing Li; Chen Zhang; Changsheng Li

arXiv:2512.04720·cs.SD·December 5, 2025

M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis

Xiaopeng Wang, Chunyu Qiang, Ruibo Fu, Zhengqi Wen, Xuefei Liu, Yukun Liu, Yuzhe Liang, Kang Yin, Yuankun Xie, Heng Xie, Chenxing Li, Chen Zhang, Changsheng Li

PDF

Open Access

TL;DR

M3-TTS introduces a multi-modal diffusion transformer-based non-autoregressive speech synthesis model that achieves high-fidelity, natural-sounding speech with efficient training and stable alignment without pseudo-alignments.

Contribution

It proposes a novel multi-modal diffusion transformer architecture for zero-shot high-fidelity TTS, eliminating the need for pseudo-alignment and improving efficiency and naturalness.

Findings

01

Achieves state-of-the-art NAR TTS performance on benchmarks.

02

Lowest word error rates of 1.36% (English) and 1.31% (Chinese).

03

Maintains competitive naturalness scores.

Abstract

Non-autoregressive (NAR) text-to-speech synthesis relies on length alignment between text sequences and audio representations, constraining naturalness and expressiveness. Existing methods depend on duration modeling or pseudo-alignment strategies that severely limit naturalness and computational efficiency. We propose M3-TTS, a concise and efficient NAR TTS paradigm based on multi-modal diffusion transformer (MM-DiT) architecture. M3-TTS employs joint diffusion transformer layers for cross-modal alignment, achieving stable monotonic alignment between variable-length text-speech sequences without pseudo-alignment requirements. Single diffusion transformer layers further enhance acoustic detail modeling. The framework integrates a mel-vae codec that provides 3* training acceleration. Experimental results on Seed-TTS and AISHELL-3 benchmarks demonstrate that M3-TTS achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders