EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with   IFUB Estimator and Joint Text-Guided Consistent Learning

Ziqi Liang; Jianzong Wang; Xulong Zhang; Yong Zhang; Ning Cheng; Jing; Xiao

arXiv:2404.19212·cs.SD·May 1, 2024

EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

Ziqi Liang, Jianzong Wang, Xulong Zhang, Yong Zhang, Ning Cheng, Jing, Xiao

PDF

Open Access

TL;DR

This paper introduces a novel two-stage self-supervised speech disentanglement model that uses an MI upper bound estimator and text-guided learning to improve voice conversion quality by better separating speech components without human-crafted features.

Contribution

The paper proposes a new two-stage model with an MI upper bound estimator and joint text-guided learning for more effective speech disentanglement in voice conversion.

Findings

01

Outperforms baseline in disentanglement effectiveness

02

Enhances speech naturalness and similarity

03

Reduces timbre leakage issues

Abstract

Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of information overlap in the disentangling process which results in less speech naturalness. To overcome such limits, we propose a two-stage model to disentangle speech representations in a self-supervised manner without a human-crafted bottleneck design, which uses the Mutual Information (MI) with the designed upper bound estimator (IFUB) to separate overlapping information between speech components. Moreover, we design a Joint Text-Guided Consistent (TGC) module to guide the extraction of speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing