RefXVC: Cross-Lingual Voice Conversion with Enhanced Reference Leveraging
Mingyang Zhang, Yi Zhou, Yi Ren, Chen Zhang, Xiang Yin, Haizhou Li

TL;DR
RefXVC introduces a cross-lingual voice conversion method that uses both global and local speaker embeddings, along with multiple references, to better capture timbre and pronunciation variations, significantly improving speech quality and speaker similarity.
Contribution
It presents a novel approach combining global/local embeddings and multiple references to enhance cross-lingual voice conversion performance.
Findings
Outperforms existing systems in speech quality
Achieves higher speaker similarity
Effectively captures timbre and pronunciation variations
Abstract
This paper proposes RefXVC, a method for cross-lingual voice conversion (XVC) that leverages reference information to improve conversion performance. Previous XVC works generally take an average speaker embedding to condition the speaker identity, which does not account for the changing timbre of speech that occurs with different pronunciations. To address this, our method uses both global and local speaker embeddings to capture the timbre changes during speech conversion. Additionally, we observed a connection between timbre and pronunciation in different languages and utilized this by incorporating a timbre encoder and a pronunciation matching network into our model. Furthermore, we found that the variation in tones is not adequately reflected in a sentence, and therefore, we used multiple references to better capture the range of a speaker's voice. The proposed method outperformed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
