How Far Are We from Robust Voice Conversion: A Survey

Tzu-hsien Huang; Jheng-hao Lin; Chien-yu Huang; Hung-yi Lee

arXiv:2011.12063·eess.AS·May 4, 2021

How Far Are We from Robust Voice Conversion: A Survey

Tzu-hsien Huang, Jheng-hao Lin, Chien-yu Huang, Hung-yi Lee

PDF

Open Access

TL;DR

This survey evaluates the robustness of current voice conversion models, highlighting factors affecting performance and proposing modifications to enhance naturalness and resilience against unseen data.

Contribution

The paper provides a comprehensive analysis of voice conversion robustness and introduces modifications like speaker embedding replacements to improve model performance.

Findings

01

Sampling rate and audio duration significantly impact VC quality.

02

AdaIN-VC shows greater robustness compared to other models.

03

Jointly trained speaker embeddings outperform identification-trained embeddings.

Abstract

Voice conversion technologies have been greatly improved in recent years with the help of deep learning, but their capabilities of producing natural sounding utterances in different conditions remain unclear. In this paper, we gave a thorough study of the robustness of known VC models. We also modified these models, such as the replacement of speaker embeddings, to further improve their performances. We found that the sampling rate and audio duration greatly influence voice conversion. All the VC models suffer from unseen data, but AdaIN-VC is relatively more robust. Also, the speaker embedding jointly trained is more suitable for voice conversion than those trained on speaker identification.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing