CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion

Takuhiro Kaneko; Hirokazu Kameoka; Kou Tanaka; Nobukatsu Hojo

arXiv:1904.04631·cs.SD·April 10, 2019·6 cites

CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

PDF

Open Access 5 Repos

TL;DR

CycleGAN-VC2 advances non-parallel voice conversion by integrating three key improvements, significantly enhancing the naturalness and similarity of converted speech compared to previous models, and effectively reducing the gap between converted and target speech.

Contribution

The paper introduces CycleGAN-VC2, an improved voice conversion model that incorporates two-step adversarial losses, a 2-1-2D CNN generator, and a PatchGAN discriminator to enhance speech quality.

Findings

01

Objective metrics show reduced Mel-cepstral distortion and modulation spectra distance.

02

Subjective tests indicate higher naturalness and similarity across speaker pairs.

03

CycleGAN-VC2 outperforms CycleGAN-VC in non-parallel voice conversion tasks.

Abstract

Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data. This is an important task, but it has been challenging due to the disadvantages of the training conditions. Recently, CycleGAN-VC has provided a breakthrough and performed comparably to a parallel VC method without relying on any extra data, modules, or time alignment procedures. However, there is still a large gap between the real target and converted speech, and bridging this gap remains a challenge. To reduce this gap, we propose CycleGAN-VC2, which is an improved version of CycleGAN-VC incorporating three new techniques: an improved objective (two-step adversarial losses), improved generator (2-1-2D CNN), and improved discriminator (PatchGAN). We evaluated our method on a non-parallel VC task and analyzed the effect of each technique in detail. An…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsResidual Connection · Batch Normalization · GAN Least Squares Loss · Cycle Consistency Loss · *Communicated@Fast*How Do I Communicate to Expedia? · Sigmoid Activation · Tanh Activation · HuMan(Expedia)||How do I get a human at Expedia? · Residual Block · Convolution