MulliVC: Multi-lingual Voice Conversion With Cycle Consistency
Jiawei Huang, Chen Zhang, Yi Ren, Ziyue Jiang, Zhenhui Ye, Jinglin, Liu, Jinzheng He, Xiang Yin, Zhou Zhao

TL;DR
MulliVC is a novel multi-lingual voice conversion system that effectively disentangles speaker timbre from content and prosody without requiring multi-lingual paired data, using a cycle consistency approach.
Contribution
It introduces a three-step training process with cycle consistency to enable multi-lingual voice conversion without paired multi-lingual datasets.
Findings
Outperforms existing methods in monolingual and cross-lingual voice conversion
Effectively disentangles timbre from content and prosody
No need for multi-lingual paired data
Abstract
Voice conversion aims to modify the source speaker's voice to resemble the target speaker while preserving the original speech content. Despite notable advancements in voice conversion these days, multi-lingual voice conversion (including both monolingual and cross-lingual scenarios) has yet to be extensively studied. It faces two main challenges: 1) the considerable variability in prosody and articulation habits across languages; and 2) the rarity of paired multi-lingual datasets from the same speaker. In this paper, we propose MulliVC, a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data. Specifically, each training step of MulliVC contains three substeps: In step one the model is trained with monolingual speech data; then, steps two and three take inspiration from back translation, construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
