DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in   Zero-Shot Speech Synthesis

Yingahao Aaron Li; Rithesh Kumar; Zeyu Jin

arXiv:2410.11097·eess.AS·February 21, 2025

DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis

Yingahao Aaron Li, Rithesh Kumar, Zeyu Jin

PDF

Open Access

TL;DR

DMOSpeech introduces a distilled diffusion-based TTS model that enables end-to-end optimization with perceptual metrics, achieving faster inference and improved speech quality through direct gradient-based training.

Contribution

It is the first to successfully perform end-to-end optimization of differentiable perceptual metrics in TTS using a distilled diffusion model.

Findings

01

Significant improvements in naturalness, intelligibility, and speaker similarity.

02

Inference time reduced by orders of magnitude.

03

Validated through extensive human evaluation.

Abstract

Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are computationally intensive, and previous distillation attempts have shown consistent quality degradation. Moreover, existing TTS approaches are limited by non-differentiable components or iterative sampling that prevent true end-to-end optimization with perceptual metrics. We introduce DMOSpeech, a distilled diffusion-based TTS model that uniquely achieves both faster inference and superior performance compared to its teacher model. By enabling direct gradient pathways to all model components, we demonstrate the first successful end-to-end optimization of differentiable metrics in TTS, incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss. Our comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsDiffusion · ALIGN