Optimizing voice conversion network with cycle consistency loss of speaker identity
Hongqiang Du, Xiaohai Tian, Lei Xie, Haizhou Li

TL;DR
This paper introduces a novel training scheme for voice conversion networks that incorporates speaker identity and cycle consistency losses, significantly improving speaker similarity in converted speech.
Contribution
It presents a new training approach combining spectral, speaker identity, and cycle consistency losses applicable to any voice conversion model.
Findings
Enhanced speaker similarity in converted speech
Outperforms baseline methods on CMU-ARCTIC and CSTR-VCTK datasets
Effective cycle consistency constraint maintains speaker identity
Abstract
We propose a novel training scheme to optimize voice conversion network with a speaker identity loss function. The training scheme not only minimizes frame-level spectral loss, but also speaker identity loss. We introduce a cycle consistency loss that constrains the converted speech to maintain the same speaker identity as reference speech at utterance level. While the proposed training scheme is applicable to any voice conversion networks, we formulate the study under the average model voice conversion framework in this paper. Experiments conducted on CMU-ARCTIC and CSTR-VCTK corpus confirm that the proposed method outperforms baseline methods in terms of speaker similarity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
