DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
Yingahao Aaron Li, Rithesh Kumar, Zeyu Jin

TL;DR
DMOSpeech introduces a distilled diffusion-based TTS model that enables end-to-end optimization with perceptual metrics, achieving faster inference and improved speech quality through direct gradient-based training.
Contribution
It is the first to successfully perform end-to-end optimization of differentiable perceptual metrics in TTS using a distilled diffusion model.
Findings
Significant improvements in naturalness, intelligibility, and speaker similarity.
Inference time reduced by orders of magnitude.
Validated through extensive human evaluation.
Abstract
Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are computationally intensive, and previous distillation attempts have shown consistent quality degradation. Moreover, existing TTS approaches are limited by non-differentiable components or iterative sampling that prevent true end-to-end optimization with perceptual metrics. We introduce DMOSpeech, a distilled diffusion-based TTS model that uniquely achieves both faster inference and superior performance compared to its teacher model. By enabling direct gradient pathways to all model components, we demonstrate the first successful end-to-end optimization of differentiable metrics in TTS, incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss. Our comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsDiffusion · ALIGN
