CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model
Zhen Ye, Wei Xue, Xu Tan, Jie Chen, Qifeng Liu, Yike Guo

TL;DR
CoMoSpeech introduces a one-step diffusion-based speech synthesis method that significantly accelerates inference speed while maintaining high audio quality, making diffusion models practical for real-time applications.
Contribution
It proposes a novel consistency model distillation approach enabling single-step diffusion sampling for speech synthesis, drastically improving inference speed.
Findings
Achieves over 150x faster inference than real-time on a single GPU.
Maintains comparable or superior audio quality to multi-step diffusion models.
Outperforms traditional methods like FastSpeech2 in inference speed with high-quality output.
Abstract
Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a "Co"nsistency "Mo"del-based "Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Voice and Speech Disorders
MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
