FlashSpeech: Efficient Zero-Shot Speech Synthesis

Zhen Ye; Zeqian Ju; Haohe Liu; Xu Tan; Jianyi Chen; Yiwen Lu; Peiwen; Sun; Jiahao Pan; Weizhen Bian; Shulin He; Wei Xue; Qifeng Liu; Yike Guo

arXiv:2404.14700·eess.AS·October 25, 2024·1 cites

FlashSpeech: Efficient Zero-Shot Speech Synthesis

Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen, Sun, Jiahao Pan, Weizhen Bian, Shulin He, Wei Xue, Qifeng Liu, Yike Guo

PDF

Open Access 1 Repo

TL;DR

FlashSpeech is a fast, efficient zero-shot speech synthesis system that achieves high-quality, natural-sounding speech with significantly reduced inference time, enabling versatile applications like voice conversion and speech editing.

Contribution

The paper introduces FlashSpeech, a novel zero-shot speech synthesis model that is approximately 20 times faster than previous systems while maintaining high quality and diversity, using a new adversarial consistency training approach.

Findings

01

FlashSpeech reduces inference time to about 5% of previous systems.

02

It maintains high audio quality and speaker similarity in zero-shot synthesis.

03

The system is versatile, supporting voice conversion, speech editing, and diverse sampling.

Abstract

Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhenye234/CoMoSpeech
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsDiffusion