FlashSpeech: Efficient Zero-Shot Speech Synthesis
Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen, Sun, Jiahao Pan, Weizhen Bian, Shulin He, Wei Xue, Qifeng Liu, Yike Guo

TL;DR
FlashSpeech is a fast, efficient zero-shot speech synthesis system that achieves high-quality, natural-sounding speech with significantly reduced inference time, enabling versatile applications like voice conversion and speech editing.
Contribution
The paper introduces FlashSpeech, a novel zero-shot speech synthesis model that is approximately 20 times faster than previous systems while maintaining high quality and diversity, using a new adversarial consistency training approach.
Findings
FlashSpeech reduces inference time to about 5% of previous systems.
It maintains high audio quality and speaker similarity in zero-shot synthesis.
The system is versatile, supporting voice conversion, speech editing, and diverse sampling.
Abstract
Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsDiffusion
