CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency   Model

Zhen Ye; Wei Xue; Xu Tan; Jie Chen; Qifeng Liu; Yike Guo

arXiv:2305.06908·cs.SD·October 31, 2023·1 cites

CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

Zhen Ye, Wei Xue, Xu Tan, Jie Chen, Qifeng Liu, Yike Guo

PDF

Open Access 1 Repo

TL;DR

CoMoSpeech introduces a one-step diffusion-based speech synthesis method that significantly accelerates inference speed while maintaining high audio quality, making diffusion models practical for real-time applications.

Contribution

It proposes a novel consistency model distillation approach enabling single-step diffusion sampling for speech synthesis, drastically improving inference speed.

Findings

01

Achieves over 150x faster inference than real-time on a single GPU.

02

Maintains comparable or superior audio quality to multi-step diffusion models.

03

Outperforms traditional methods like FastSpeech2 in inference speed with high-quality output.

Abstract

Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a "Co"nsistency "Mo"del-based "Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhenye234/CoMoSpeech
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Voice and Speech Disorders

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings