Zero-Shot Accent Conversion using Pseudo Siamese Disentanglement Network
Dongya Jia, Qiao Tian, Kainan Peng, Jiaxin Li, Yuanzhe Chen, Mingbo, Ma, Yuping Wang, Yuxuan Wang

TL;DR
This paper introduces a zero-shot, reference-free accent conversion method using a Pseudo Siamese Disentanglement Network, enabling conversion of unseen speakers' speech into target accents while maintaining content and naturalness.
Contribution
The paper proposes a novel PSDN model that disentangles accent from content for zero-shot, reference-free accent conversion, addressing limitations of previous methods.
Findings
Achieves higher accentedness in converted speech.
Maintains comparable naturalness to original speech.
Effective in foreign-to-native and native-to-foreign conversions.
Abstract
The goal of accent conversion (AC) is to convert the accent of speech into the target accent while preserving the content and speaker identity. AC enables a variety of applications, such as language learning, speech content creation, and data augmentation. Previous methods rely on reference utterances in the inference phase or are unable to preserve speaker identity. To address these issues, we propose a zero-shot reference-free accent conversion method, which is able to convert unseen speakers' utterances into a target accent. Pseudo Siamese Disentanglement Network (PSDN) is proposed to disentangle the accent from the content representation. Experimental results show that our model generates speech samples with much higher accentedness than the input and comparable naturalness, on two-way conversion including foreign-to-native and native-to-foreign.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
