AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual Voice Conversion
Haeyun Choi, Jio Gim, Yuho Lee, Youngin Kim, and Young-Joo Suh

TL;DR
AutoCycle-VC introduces a robust zero-shot cross-lingual voice conversion system that overcomes bottleneck limitations, utilizing cycle-consistency loss and global speaker representations for improved quality and versatility.
Contribution
The paper presents a novel cycle-structure-based zero-shot voice conversion method that effectively extracts speaker representations without bottleneck constraints, enabling high-quality cross-lingual conversion.
Findings
Outperforms state-of-the-art in subjective and objective evaluations.
Enables high-quality cross-lingual voice conversion.
Improves synthesis quality with cycle-consistency loss.
Abstract
This paper proposes a simple and robust zero-shot voice conversion system with a cycle structure and mel-spectrogram pre-processing. Previous works suffer from information loss and poor synthesis quality due to their reliance on a carefully designed bottleneck structure. Moreover, models relying solely on self-reconstruction loss struggled with reproducing different speakers' voices. To address these issues, we suggested a cycle-consistency loss that considers conversion back and forth between target and source speakers. Additionally, stacked random-shuffled mel-spectrograms and a label smoothing method are utilized during speaker encoder training to extract a time-independent global speaker representation from speech, which is the key to a zero-shot conversion. Our model outperforms existing state-of-the-art results in both subjective and objective evaluations. Furthermore, it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsLabel Smoothing
