SF-Speech: Straightened Flow for Zero-Shot Voice Clone
Xuyuan Li, Zengqiang Shang, Hua Hua, Peiyang Shi, Chen Yang, Li Wang,, Pengyuan Zhang

TL;DR
SF-Speech introduces a novel ODE-based voice cloning model that improves trajectory straightness and efficiency, outperforming state-of-the-art methods in zero-shot TTS with fewer solver steps and faster generation.
Contribution
The paper proposes a lightweight multi-stage module to generate deterministic initial distributions and straighten ODE trajectories without extra loss functions.
Findings
Outperforms state-of-the-art zero-shot TTS methods.
Requires only a quarter of the solver steps compared to previous models.
Achieves approximately 3.7 times faster generation speed.
Abstract
Recently, neural ordinary differential equations (ODE) models trained with flow matching have achieved impressive performance on the zero-shot voice clone task. Nevertheless, postulating standard Gaussian noise as the initial distribution of ODE gives rise to numerous intersections within the fitted targets of flow matching, which presents challenges to model training and enhances the curvature of the learned generated trajectories. These curved trajectories restrict the capacity of ODE models for generating desirable samples with a few steps. This paper proposes SF-Speech, a novel voice clone model based on ODE and in-context learning. Unlike the previous works, SF-Speech adopts a lightweight multi-stage module to generate a more deterministic initial distribution for ODE. Without introducing any additional loss function, we effectively straighten the curved reverse trajectories of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
