MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in   Frames

Takuhiro Kaneko; Hirokazu Kameoka; Kou Tanaka; Nobukatsu Hojo

arXiv:2102.12841·cs.SD·February 26, 2021·1 cites

MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

PDF

Open Access 3 Repos

TL;DR

MaskCycleGAN-VC introduces a self-supervised filling in frames task to improve non-parallel voice conversion, effectively capturing time-frequency structures without increasing model complexity.

Contribution

It proposes a novel auxiliary task called filling in frames, enabling better mel-spectrogram conversion without additional modules or larger models.

Findings

01

Outperforms CycleGAN-VC2 and CycleGAN-VC3 in naturalness and speaker similarity

02

Maintains similar model size to CycleGAN-VC2

03

Learns time-frequency structures effectively through self-supervision

Abstract

Non-parallel voice conversion (VC) is a technique for training voice converters without a parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However, owing to their insufficient ability to grasp time-frequency structures, their application is limited to mel-cepstrum conversion and not mel-spectrogram conversion despite recent advances in mel-spectrogram vocoders. To overcome this, CycleGAN-VC3, an improved variant of CycleGAN-VC2 that incorporates an additional module called time-frequency adaptive normalization (TFAN), has been proposed. However, an increase in the number of learned parameters is imposed. As an alternative, we propose MaskCycleGAN-VC, which is another extension of CycleGAN-VC2 and is trained using a novel auxiliary task called filling in frames (FIF). With FIF, we apply a temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing