Optimizing voice conversion network with cycle consistency loss of   speaker identity

Hongqiang Du; Xiaohai Tian; Lei Xie; Haizhou Li

arXiv:2011.08548·cs.SD·November 18, 2020

Optimizing voice conversion network with cycle consistency loss of speaker identity

Hongqiang Du, Xiaohai Tian, Lei Xie, Haizhou Li

PDF

TL;DR

This paper introduces a novel training scheme for voice conversion networks that incorporates speaker identity and cycle consistency losses, significantly improving speaker similarity in converted speech.

Contribution

It presents a new training approach combining spectral, speaker identity, and cycle consistency losses applicable to any voice conversion model.

Findings

01

Enhanced speaker similarity in converted speech

02

Outperforms baseline methods on CMU-ARCTIC and CSTR-VCTK datasets

03

Effective cycle consistency constraint maintains speaker identity

Abstract

We propose a novel training scheme to optimize voice conversion network with a speaker identity loss function. The training scheme not only minimizes frame-level spectral loss, but also speaker identity loss. We introduce a cycle consistency loss that constrains the converted speech to maintain the same speaker identity as reference speech at utterance level. While the proposed training scheme is applicable to any voice conversion networks, we formulate the study under the average model voice conversion framework in this paper. Experiments conducted on CMU-ARCTIC and CSTR-VCTK corpus confirm that the proposed method outperforms baseline methods in terms of speaker similarity.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.