CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

TL;DR
CycleGAN-VC2 advances non-parallel voice conversion by integrating three key improvements, significantly enhancing the naturalness and similarity of converted speech compared to previous models, and effectively reducing the gap between converted and target speech.
Contribution
The paper introduces CycleGAN-VC2, an improved voice conversion model that incorporates two-step adversarial losses, a 2-1-2D CNN generator, and a PatchGAN discriminator to enhance speech quality.
Findings
Objective metrics show reduced Mel-cepstral distortion and modulation spectra distance.
Subjective tests indicate higher naturalness and similarity across speaker pairs.
CycleGAN-VC2 outperforms CycleGAN-VC in non-parallel voice conversion tasks.
Abstract
Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data. This is an important task, but it has been challenging due to the disadvantages of the training conditions. Recently, CycleGAN-VC has provided a breakthrough and performed comparably to a parallel VC method without relying on any extra data, modules, or time alignment procedures. However, there is still a large gap between the real target and converted speech, and bridging this gap remains a challenge. To reduce this gap, we propose CycleGAN-VC2, which is an improved version of CycleGAN-VC incorporating three new techniques: an improved objective (two-step adversarial losses), improved generator (2-1-2D CNN), and improved discriminator (PatchGAN). We evaluated our method on a non-parallel VC task and analyzed the effect of each technique in detail. An…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsResidual Connection · Batch Normalization · GAN Least Squares Loss · Cycle Consistency Loss · *Communicated@Fast*How Do I Communicate to Expedia? · Sigmoid Activation · Tanh Activation · HuMan(Expedia)||How do I get a human at Expedia? · Residual Block · Convolution
