VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

Yiwei Guo; Chenpeng Du; Ziyang Ma; Xie Chen; Kai Yu

arXiv:2309.05027·eess.AS·September 4, 2024

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, Kai Yu

PDF

Open Access 1 Repo

TL;DR

VoiceFlow introduces a novel rectified flow matching approach for text-to-speech synthesis, significantly improving efficiency and quality by reducing sampling steps compared to traditional diffusion models.

Contribution

It proposes a new rectified flow matching algorithm for TTS that enhances synthesis quality with fewer sampling steps, addressing diffusion models' efficiency issues.

Findings

01

VoiceFlow achieves superior synthesis quality over diffusion models.

02

The rectified flow technique effectively straightens sampling trajectories.

03

Ablation studies confirm the effectiveness of the rectified flow method.

Abstract

Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the process of generating mel-spectrograms into an ordinary differential equation conditional on text inputs, whose vector field is then estimated. The rectified flow technique then effectively straightens its sampling trajectory for efficient synthesis. Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation studies further verified the validity of the rectified flow technique…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

X-LANCE/VoiceFlow-TTS
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsDiffusion