How Far Are We from Robust Voice Conversion: A Survey
Tzu-hsien Huang, Jheng-hao Lin, Chien-yu Huang, Hung-yi Lee

TL;DR
This survey evaluates the robustness of current voice conversion models, highlighting factors affecting performance and proposing modifications to enhance naturalness and resilience against unseen data.
Contribution
The paper provides a comprehensive analysis of voice conversion robustness and introduces modifications like speaker embedding replacements to improve model performance.
Findings
Sampling rate and audio duration significantly impact VC quality.
AdaIN-VC shows greater robustness compared to other models.
Jointly trained speaker embeddings outperform identification-trained embeddings.
Abstract
Voice conversion technologies have been greatly improved in recent years with the help of deep learning, but their capabilities of producing natural sounding utterances in different conditions remain unclear. In this paper, we gave a thorough study of the robustness of known VC models. We also modified these models, such as the replacement of speaker embeddings, to further improve their performances. We found that the sampling rate and audio duration greatly influence voice conversion. All the VC models suffer from unseen data, but AdaIN-VC is relatively more robust. Also, the speaker embedding jointly trained is more suitable for voice conversion than those trained on speaker identification.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
