Enriching Source Style Transfer in Recognition-Synthesis based   Non-Parallel Voice Conversion

Zhichao Wang; Xinyong Zhou; Fengyu Yang; Tao Li; Hongqiang Du; Lei; Xie; Wendong Gan; Haitao Chen; Hai Li

arXiv:2106.08741·eess.AS·June 29, 2021

Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion

Zhichao Wang, Xinyong Zhou, Fengyu Yang, Tao Li, Hongqiang Du, Lei, Xie, Wendong Gan, Haitao Chen, Hai Li

PDF

Open Access

TL;DR

This paper introduces a hybrid prosody modeling approach for source style transfer in non-parallel voice conversion, combining explicit features and implicit latent representations with adversarial training to improve style transfer quality.

Contribution

It proposes a novel hybrid prosody modeling method that integrates explicit and implicit prosody representations within a recognition-synthesis framework for voice conversion.

Findings

01

Outperforms baseline in style transfer quality

02

Maintains speech quality and speaker similarity

03

Effective prosody transfer demonstrated

Abstract

Current voice conversion (VC) methods can successfully convert timbre of the audio. As modeling source audio's prosody effectively is a challenging task, there are still limitations of transferring source style to the converted speech. This study proposes a source style transfer method based on recognition-synthesis framework. Previously in speech generation task, prosody can be modeled explicitly with prosodic features or implicitly with a latent prosody extractor. In this paper, taking advantages of both, we model the prosody in a hybrid manner, which effectively combines explicit and implicit methods in a proposed prosody module. Specifically, prosodic features are used to explicit model prosody, while VAE and reference encoder are used to implicitly model prosody, which take Mel spectrum and bottleneck feature as input respectively. Furthermore, adversarial training is introduced to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing