Accent and Speaker Disentanglement in Many-to-many Voice Conversion

Zhichao Wang; Wenshuo Ge; Xiong Wang; Shan Yang; Wendong Gan; Haitao; Chen; Hai Li; Lei Xie; Xiulin Li

arXiv:2011.08609·cs.SD·November 18, 2020

Accent and Speaker Disentanglement in Many-to-many Voice Conversion

Zhichao Wang, Wenshuo Ge, Xiong Wang, Shan Yang, Wendong Gan, Haitao, Chen, Hai Li, Lei Xie, Xiulin Li

PDF

Open Access

TL;DR

This paper introduces a novel voice and accent joint conversion method that effectively disentangles and recombines speaker and accent information, improving accent conversion quality while maintaining speaker similarity.

Contribution

It presents a new recognition-synthesis framework with two key techniques: accent-dependent recognizers and adversarial training for disentanglement, advancing voice conversion technology.

Findings

01

Enhanced accentedness and audio quality

02

Maintained speaker similarity

03

Outperformed baseline methods

Abstract

This paper proposes an interesting voice and accent joint conversion approach, which can convert an arbitrary source speaker's voice to a target speaker with non-native accent. This problem is challenging as each target speaker only has training data in native accent and we need to disentangle accent and speaker information in the conversion model training and re-combine them in the conversion stage. In our recognition-synthesis conversion framework, we manage to solve this problem by two proposed tricks. First, we use accent-dependent speech recognizers to obtain bottleneck features for different accented speakers. This aims to wipe out other factors beyond the linguistic information in the BN features for conversion model training. Second, we propose to use adversarial training to better disentangle the speaker and accent information in our encoder-decoder based conversion model.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders