VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching
Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, Kai Yu

TL;DR
VoiceFlow introduces a novel rectified flow matching approach for text-to-speech synthesis, significantly improving efficiency and quality by reducing sampling steps compared to traditional diffusion models.
Contribution
It proposes a new rectified flow matching algorithm for TTS that enhances synthesis quality with fewer sampling steps, addressing diffusion models' efficiency issues.
Findings
VoiceFlow achieves superior synthesis quality over diffusion models.
The rectified flow technique effectively straightens sampling trajectories.
Ablation studies confirm the effectiveness of the rectified flow method.
Abstract
Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the process of generating mel-spectrograms into an ordinary differential equation conditional on text inputs, whose vector field is then estimated. The rectified flow technique then effectively straightens its sampling trajectory for efficient synthesis. Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation studies further verified the validity of the rectified flow technique…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsDiffusion
