Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling
Qixi Zheng, Yushen Chen, Zhikang Niu, Ziyang Ma, Xiaofei Wang, Kai Yu, Xie Chen

TL;DR
This paper introduces Fast F5-TTS, a training-free method that employs Empirically Pruned Step Sampling to significantly accelerate flow-matching-based TTS models, reducing sampling steps and inference time while preserving quality.
Contribution
The paper proposes a novel non-uniform sampling strategy, EPSS, that reduces sampling steps in flow-matching TTS models without retraining, improving inference speed.
Findings
Achieves 4x faster inference with 7-step generation on F5-TTS.
Maintains comparable speech quality despite reduced sampling steps.
Demonstrates strong generalization on E2 TTS models.
Abstract
Flow-matching-based text-to-speech (TTS) models, such as Voicebox, E2 TTS, and F5-TTS, have attracted significant attention in recent years. These models require multiple sampling steps to reconstruct speech from noise, making inference speed a key challenge. Reducing the number of sampling steps can greatly improve inference efficiency. To this end, we introduce Fast F5-TTS, a training-free approach to accelerate the inference of flow-matching-based TTS models. By inspecting the sampling trajectory of F5-TTS, we identify redundant steps and propose Empirically Pruned Step Sampling (EPSS), a non-uniform time-step sampling strategy that effectively reduces the number of sampling steps. Our approach achieves a 7-step generation with an inference RTF of 0.030 on an NVIDIA RTX 3090 GPU, making it 4 times faster than the original F5-TTS while maintaining comparable performance. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
