RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching

Hyun Joon Park; Jeongmin Liu; Jin Sob Kim; Jeong Yeol Yang; Sung Won Han; Eunwoo Song

arXiv:2506.16741·eess.AS·June 23, 2025

RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching

Hyun Joon Park, Jeongmin Liu, Jin Sob Kim, Jeong Yeol Yang, Sung Won Han, Eunwoo Song

PDF

Open Access 1 Repo

TL;DR

RapFlow-TTS is a novel text-to-speech model that significantly reduces synthesis steps while maintaining high quality by enforcing velocity consistency in flow matching training.

Contribution

It introduces velocity consistency constraints in flow matching training and techniques like time interval scheduling and adversarial learning for efficient high-fidelity TTS.

Findings

01

Achieves 5- and 10-fold reduction in synthesis steps

02

Maintains high speech quality with fewer steps

03

Outperforms conventional flow matching approaches

Abstract

We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, RapFlow-TTS enforces consistency in the velocity field along the FM-straightened ODE trajectory, enabling consistent synthetic quality with fewer generation steps. Additionally, we introduce techniques such as time interval scheduling and adversarial learning to further enhance the quality of the few-step synthesis. Experimental results show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5- and 10-fold reduction in synthesis steps than the conventional FM- and score-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

naver-ai/RapFlow-TTS
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and Audio Processing