TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and   Adversarial Training

Huaizhen Tang; Xulong Zhang; Jianzong Wang; Ning Cheng; Zhen Zeng,; Edward Xiao; Jing Xiao

arXiv:2208.04035·cs.SD·August 9, 2022

TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training

Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Zhen Zeng,, Edward Xiao, Jing Xiao

PDF

TL;DR

This paper introduces TGAVC, a novel voice conversion framework that combines text guidance and adversarial training to better disentangle speech content and speaker identity, resulting in improved naturalness and similarity.

Contribution

The paper proposes a new autoencoder-based voice conversion method that incorporates text-guided content embedding and adversarial training for enhanced separation of content and speaker features.

Findings

01

Outperforms AutoVC in naturalness of converted speech

02

Achieves higher similarity scores in voice conversion

03

Effectively disentangles content and speaker identity

Abstract

Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Recently, AutoVC, a conditional autoencoder based method, achieved excellent conversion results by disentangling the speaker identity and the speech content using information-constraining bottlenecks. However, due to the pure autoencoder training method, it is difficult to evaluate the separation effect of content and speaker identity. In this paper, a novel voice conversion framework, named $T$ ext $G$ uided $A$ utoVC(TGAVC), is proposed to more effectively separate content and timbre from speech, where an expected content embedding produced based on the text transcriptions is designed to guide the extraction of voice content. In addition, the adversarial training is applied to eliminate the speaker identity information in the estimated content…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.