Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN
Zongyang Du, Kun Zhou, Berrak Sisman, Haizhou Li

TL;DR
This paper introduces a novel cross-lingual voice conversion method using CycleGAN that models spectral and prosodic features, specifically F0 with wavelet transform, without requiring parallel data, improving naturalness and similarity.
Contribution
It presents the first approach to incorporate prosody, using wavelet-based F0 modeling, into cross-lingual voice conversion with CycleGAN, eliminating the need for parallel data.
Findings
Outperforms baseline in subjective evaluation
First to incorporate prosody in cross-lingual conversion
Uses wavelet transform for hierarchical F0 modeling
Abstract
Cross-lingual voice conversion aims to change source speaker's voice to sound like that of target speaker, when source and target speakers speak different languages. It relies on non-parallel training data from two different languages, hence, is more challenging than mono-lingual voice conversion. Previous studies on cross-lingual voice conversion mainly focus on spectral conversion with a linear transformation for F0 transfer. However, as an important prosodic factor, F0 is inherently hierarchical, thus it is insufficient to just use a linear method for conversion. We propose the use of continuous wavelet transform (CWT) decomposition for F0 modeling. CWT provides a way to decompose a signal into different temporal scales that explain prosody in different time resolutions. We also propose to train two CycleGAN pipelines for spectrum and prosody mapping respectively. In this way, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsBatch Normalization · Tanh Activation · Cycle Consistency Loss · *Communicated@Fast*How Do I Communicate to Expedia? · Instance Normalization · HuMan(Expedia)||How do I get a human at Expedia? · Residual Connection · Residual Block · Convolution · Sigmoid Activation
