Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling

Qixi Zheng; Yushen Chen; Zhikang Niu; Ziyang Ma; Xiaofei Wang; Kai Yu; Xie Chen

arXiv:2505.19931·eess.AS·June 5, 2025

Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling

Qixi Zheng, Yushen Chen, Zhikang Niu, Ziyang Ma, Xiaofei Wang, Kai Yu, Xie Chen

PDF

Open Access

TL;DR

This paper introduces Fast F5-TTS, a training-free method that employs Empirically Pruned Step Sampling to significantly accelerate flow-matching-based TTS models, reducing sampling steps and inference time while preserving quality.

Contribution

The paper proposes a novel non-uniform sampling strategy, EPSS, that reduces sampling steps in flow-matching TTS models without retraining, improving inference speed.

Findings

01

Achieves 4x faster inference with 7-step generation on F5-TTS.

02

Maintains comparable speech quality despite reduced sampling steps.

03

Demonstrates strong generalization on E2 TTS models.

Abstract

Flow-matching-based text-to-speech (TTS) models, such as Voicebox, E2 TTS, and F5-TTS, have attracted significant attention in recent years. These models require multiple sampling steps to reconstruct speech from noise, making inference speed a key challenge. Reducing the number of sampling steps can greatly improve inference efficiency. To this end, we introduce Fast F5-TTS, a training-free approach to accelerate the inference of flow-matching-based TTS models. By inspecting the sampling trajectory of F5-TTS, we identify redundant steps and propose Empirically Pruned Step Sampling (EPSS), a non-uniform time-step sampling strategy that effectively reduces the number of sampling steps. Our approach achieves a 7-step generation with an inference RTF of 0.030 on an NVIDIA RTX 3090 GPU, making it 4 times faster than the original F5-TTS while maintaining comparable performance. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings