RefXVC: Cross-Lingual Voice Conversion with Enhanced Reference   Leveraging

Mingyang Zhang; Yi Zhou; Yi Ren; Chen Zhang; Xiang Yin; Haizhou Li

arXiv:2406.16326·eess.AS·June 25, 2024·IEEE ACM Trans. Audio Speech Lang. Process.·1 cites

RefXVC: Cross-Lingual Voice Conversion with Enhanced Reference Leveraging

Mingyang Zhang, Yi Zhou, Yi Ren, Chen Zhang, Xiang Yin, Haizhou Li

PDF

Open Access

TL;DR

RefXVC introduces a cross-lingual voice conversion method that uses both global and local speaker embeddings, along with multiple references, to better capture timbre and pronunciation variations, significantly improving speech quality and speaker similarity.

Contribution

It presents a novel approach combining global/local embeddings and multiple references to enhance cross-lingual voice conversion performance.

Findings

01

Outperforms existing systems in speech quality

02

Achieves higher speaker similarity

03

Effectively captures timbre and pronunciation variations

Abstract

This paper proposes RefXVC, a method for cross-lingual voice conversion (XVC) that leverages reference information to improve conversion performance. Previous XVC works generally take an average speaker embedding to condition the speaker identity, which does not account for the changing timbre of speech that occurs with different pronunciations. To address this, our method uses both global and local speaker embeddings to capture the timbre changes during speech conversion. Additionally, we observed a connection between timbre and pronunciation in different languages and utilized this by incorporating a timbre encoder and a pronunciation matching network into our model. Furthermore, we found that the variation in tones is not adequately reflected in a sentence, and therefore, we used multiple references to better capture the range of a speaker's voice. The proposed method outperformed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques