CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker   Style Adaptation

Ziqi Liang; Xulong Zhang; Chang Liu; Xiaoyang Qu; Weifeng Zhao,; Jianzong Wang

arXiv:2501.01861·cs.SD·January 6, 2025

CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation

Ziqi Liang, Xulong Zhang, Chang Liu, Xiaoyang Qu, Weifeng Zhao,, Jianzong Wang

PDF

Open Access

TL;DR

CycleFlow introduces a cycle consistency-based flow matching method for non-parallel voice conversion, significantly enhancing speaker similarity and speech quality by addressing pitch and style adaptation challenges.

Contribution

The paper proposes CycleFlow, a novel cycle consistency approach in flow matching for improved non-parallel voice conversion and pitch adaptation.

Findings

01

Enhanced speaker similarity and naturalness in converted speech

02

Significant improvement in pitch and style adaptation quality

03

Outperforms existing VC methods in experiments

Abstract

Voice Conversion (VC) aims to convert the style of a source speaker, such as timbre and pitch, to the style of any target speaker while preserving the linguistic content. However, the ground truth of the converted speech does not exist in a non-parallel VC scenario, which induces the train-inference mismatch problem. Moreover, existing methods still have an inaccurate pitch and low speaker adaptation quality, there is a significant disparity in pitch between the source and target speaker style domains. As a result, the models tend to generate speech with hoarseness, posing challenges in achieving high-quality voice conversion. In this study, we propose CycleFlow, a novel VC approach that leverages cycle consistency in conditional flow matching (CFM) for speaker timbre adaptation training on non-parallel data. Furthermore, we design a Dual-CFM based on VoiceCFM and PitchCFM to generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and dialogue systems