TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training
Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Zhen Zeng,, Edward Xiao, Jing Xiao

TL;DR
This paper introduces TGAVC, a novel voice conversion framework that combines text guidance and adversarial training to better disentangle speech content and speaker identity, resulting in improved naturalness and similarity.
Contribution
The paper proposes a new autoencoder-based voice conversion method that incorporates text-guided content embedding and adversarial training for enhanced separation of content and speaker features.
Findings
Outperforms AutoVC in naturalness of converted speech
Achieves higher similarity scores in voice conversion
Effectively disentangles content and speaker identity
Abstract
Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Recently, AutoVC, a conditional autoencoder based method, achieved excellent conversion results by disentangling the speaker identity and the speech content using information-constraining bottlenecks. However, due to the pure autoencoder training method, it is difficult to evaluate the separation effect of content and speaker identity. In this paper, a novel voice conversion framework, named ext uided utoVC(TGAVC), is proposed to more effectively separate content and timbre from speech, where an expected content embedding produced based on the text transcriptions is designed to guide the extraction of voice content. In addition, the adversarial training is applied to eliminate the speaker identity information in the estimated content…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
