LLaDA-TTS: Unifying Speech Synthesis and Zero-Shot Editing via Masked Diffusion Modeling

Xiaoyu Fan; Huizhi Xie; Wei Zou; Yunzhang Chen

arXiv:2603.26364·cs.SD·March 30, 2026

LLaDA-TTS: Unifying Speech Synthesis and Zero-Shot Editing via Masked Diffusion Modeling

Xiaoyu Fan, Huizhi Xie, Wei Zou, Yunzhang Chen

PDF

1 Repo

TL;DR

LLaDA-TTS introduces a masked diffusion model for speech synthesis that accelerates inference, enables zero-shot editing, and transfers from autoregressive models with minimal fine-tuning.

Contribution

It presents a novel masked diffusion approach for TTS that achieves faster inference and zero-shot editing, building on pretrained AR models with minimal data.

Findings

01

Achieves 2x speedup over AR models without KV cache.

02

Matches baseline performance with only 50 hours of fine-tuning.

03

Enables zero-shot speech editing like insertion, deletion, and substitution.

Abstract

Large language model (LLM)-based text-to-speech (TTS) systems achieve remarkable naturalness via autoregressive (AR) decoding, but require N sequential steps to generate N speech tokens. We present LLaDA-TTS, which replaces the AR LLM with a masked diffusion model that completes generation in a fixed number of parallel steps, decoupling inference latency from sequence length. Remarkably, using only 50 hours of fine-tuning data, we successfully transfer a pretrained AR checkpoint to the masked diffusion paradigm via bidirectional attention. At 64 steps, LLaDA-TTS achieves 0.98% CER (zh) and 1.96% WER (en) on Seed-TTS-Eval, matching the original CosyVoice 3 baseline performance while delivering a 2x LLM-stage speedup--a notable acceleration achieved despite the absence of KV cache, an optimization the AR baseline heavily relies on. Beyond acceleration, the bidirectional architecture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://deft-piroshki-b652b5.netlify.app
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.