TL;DR
LLaDA-TTS introduces a masked diffusion model for speech synthesis that accelerates inference, enables zero-shot editing, and transfers from autoregressive models with minimal fine-tuning.
Contribution
It presents a novel masked diffusion approach for TTS that achieves faster inference and zero-shot editing, building on pretrained AR models with minimal data.
Findings
Achieves 2x speedup over AR models without KV cache.
Matches baseline performance with only 50 hours of fine-tuning.
Enables zero-shot speech editing like insertion, deletion, and substitution.
Abstract
Large language model (LLM)-based text-to-speech (TTS) systems achieve remarkable naturalness via autoregressive (AR) decoding, but require N sequential steps to generate N speech tokens. We present LLaDA-TTS, which replaces the AR LLM with a masked diffusion model that completes generation in a fixed number of parallel steps, decoupling inference latency from sequence length. Remarkably, using only 50 hours of fine-tuning data, we successfully transfer a pretrained AR checkpoint to the masked diffusion paradigm via bidirectional attention. At 64 steps, LLaDA-TTS achieves 0.98% CER (zh) and 1.96% WER (en) on Seed-TTS-Eval, matching the original CosyVoice 3 baseline performance while delivering a 2x LLM-stage speedup--a notable acceleration achieved despite the absence of KV cache, an optimization the AR baseline heavily relies on. Beyond acceleration, the bidirectional architecture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
