Subband-based Generative Adversarial Network for Non-parallel   Many-to-many Voice Conversion

Jian Ma; Zhedong Zheng; Hao Fei; Feng Zheng; Tat-seng Chua; Yi Yang

arXiv:2207.06057·cs.SD·July 28, 2022

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

Jian Ma, Zhedong Zheng, Hao Fei, Feng Zheng, Tat-seng Chua, Yi Yang

PDF

Open Access

TL;DR

This paper introduces SGAN-VC, a novel subband-based GAN framework for non-parallel many-to-many voice conversion that improves style similarity and intelligibility without requiring paired data.

Contribution

The paper proposes a new subband-based GAN architecture with style and content encoders, and a pitch-shift module, advancing non-parallel voice conversion methods.

Findings

01

Achieves state-of-the-art results on VCTK and AISHELL3 datasets.

02

Outperforms existing methods in style similarity and intelligibility.

03

Effective on both seen and unseen data.

Abstract

Voice conversion is to generate a new speech with the source content and a target voice style. In this paper, we focus on one general setting, i.e., non-parallel many-to-many voice conversion, which is close to the real-world scenario. As the name implies, non-parallel many-to-many voice conversion does not require the paired source and reference speeches and can be applied to arbitrary voice transfer. In recent years, Generative Adversarial Networks (GANs) and other techniques such as Conditional Variational Autoencoders (CVAEs) have made considerable progress in this field. However, due to the sophistication of voice conversion, the style similarity of the converted speech is still unsatisfactory. Inspired by the inherent structure of mel-spectrogram, we propose a new voice conversion framework, i.e., Subband-based Generative Adversarial Network for Voice Conversion (SGAN-VC). SGAN-VC…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing