CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

TL;DR
This paper introduces CycleGAN-VC3, an improved non-parallel voice conversion method that effectively converts mel-spectrograms by incorporating time-frequency adaptive normalization, outperforming previous CycleGAN-VC models.
Contribution
CycleGAN-VC3 enhances mel-spectrogram conversion by integrating TFAN, addressing structural preservation issues in previous CycleGAN-VC models, and demonstrating improved naturalness and similarity.
Findings
CycleGAN-VC3 outperforms previous models in subjective evaluations.
Incorporating TFAN improves preservation of time-frequency structure.
CycleGAN-VC3 is effective for both inter-gender and intra-gender voice conversion.
Abstract
Non-parallel voice conversion (VC) is a technique for learning mappings between source and target speeches without using a parallel corpus. Recently, cycle-consistent adversarial network (CycleGAN)-VC and CycleGAN-VC2 have shown promising results regarding this problem and have been widely used as benchmark methods. However, owing to the ambiguity of the effectiveness of CycleGAN-VC/VC2 for mel-spectrogram conversion, they are typically used for mel-cepstrum conversion even when comparative methods employ mel-spectrogram as a conversion target. To address this, we examined the applicability of CycleGAN-VC/VC2 to mel-spectrogram conversion. Through initial experiments, we discovered that their direct applications compromised the time-frequency structure that should be preserved during conversion. To remedy this, we propose CycleGAN-VC3, an improvement of CycleGAN-VC2 that incorporates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
