MulliVC: Multi-lingual Voice Conversion With Cycle Consistency

Jiawei Huang; Chen Zhang; Yi Ren; Ziyue Jiang; Zhenhui Ye; Jinglin; Liu; Jinzheng He; Xiang Yin; Zhou Zhao

arXiv:2408.04708·cs.SD·August 12, 2024

MulliVC: Multi-lingual Voice Conversion With Cycle Consistency

Jiawei Huang, Chen Zhang, Yi Ren, Ziyue Jiang, Zhenhui Ye, Jinglin, Liu, Jinzheng He, Xiang Yin, Zhou Zhao

PDF

Open Access

TL;DR

MulliVC is a novel multi-lingual voice conversion system that effectively disentangles speaker timbre from content and prosody without requiring multi-lingual paired data, using a cycle consistency approach.

Contribution

It introduces a three-step training process with cycle consistency to enable multi-lingual voice conversion without paired multi-lingual datasets.

Findings

01

Outperforms existing methods in monolingual and cross-lingual voice conversion

02

Effectively disentangles timbre from content and prosody

03

No need for multi-lingual paired data

Abstract

Voice conversion aims to modify the source speaker's voice to resemble the target speaker while preserving the original speech content. Despite notable advancements in voice conversion these days, multi-lingual voice conversion (including both monolingual and cross-lingual scenarios) has yet to be extensively studied. It faces two main challenges: 1) the considerable variability in prosody and articulation habits across languages; and 2) the rarity of paired multi-lingual datasets from the same speaker. In this paper, we propose MulliVC, a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data. Specifically, each training step of MulliVC contains three substeps: In step one the model is trained with monolingual speech data; then, steps two and three take inspiration from back translation, construct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques