Zero-Shot Accent Conversion using Pseudo Siamese Disentanglement Network

Dongya Jia; Qiao Tian; Kainan Peng; Jiaxin Li; Yuanzhe Chen; Mingbo; Ma; Yuping Wang; Yuxuan Wang

arXiv:2212.05751·eess.AS·August 11, 2023·Interspeech

Zero-Shot Accent Conversion using Pseudo Siamese Disentanglement Network

Dongya Jia, Qiao Tian, Kainan Peng, Jiaxin Li, Yuanzhe Chen, Mingbo, Ma, Yuping Wang, Yuxuan Wang

PDF

Open Access

TL;DR

This paper introduces a zero-shot, reference-free accent conversion method using a Pseudo Siamese Disentanglement Network, enabling conversion of unseen speakers' speech into target accents while maintaining content and naturalness.

Contribution

The paper proposes a novel PSDN model that disentangles accent from content for zero-shot, reference-free accent conversion, addressing limitations of previous methods.

Findings

01

Achieves higher accentedness in converted speech.

02

Maintains comparable naturalness to original speech.

03

Effective in foreign-to-native and native-to-foreign conversions.

Abstract

The goal of accent conversion (AC) is to convert the accent of speech into the target accent while preserving the content and speaker identity. AC enables a variety of applications, such as language learning, speech content creation, and data augmentation. Previous methods rely on reference utterances in the inference phase or are unable to preserve speaker identity. To address these issues, we propose a zero-shot reference-free accent conversion method, which is able to convert unseen speakers' utterances into a target accent. Pseudo Siamese Disentanglement Network (PSDN) is proposed to disentangle the accent from the content representation. Experimental results show that our model generates speech samples with much higher accentedness than the input and comparable naturalness, on two-way conversion including foreign-to-native and native-to-foreign.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques