Improving Accent Conversion with Reference Encoder and End-To-End   Text-To-Speech

Wenjie Li; Benlai Tang; Xiang Yin; Yushi Zhao; Wei Li; Kang Wang; Hao; Huang; Yuxuan Wang; Zejun Ma

arXiv:2005.09271·cs.CL·May 20, 2020

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

Wenjie Li, Benlai Tang, Xiang Yin, Yushi Zhao, Wei Li, Kang Wang, Hao, Huang, Yuxuan Wang, Zejun Ma

PDF

Open Access

TL;DR

This paper presents an end-to-end accent conversion system that uses reference encoders and GMM-based attention to enhance speech quality and accent naturalness, while preserving speaker identity.

Contribution

It introduces a novel approach combining native reference speech generation and multi-source information integration for improved accent conversion.

Findings

01

30% increase in mean opinion score for acoustic quality

02

68% preference for native accent conversion

03

Retention of speaker voice identity

Abstract

Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre. In this paper, we propose approaches to improving accent conversion applicability, as well as quality. First of all, we assume no reference speech is available at the conversion stage, and hence we employ an end-to-end text-to-speech system that is trained on native speech to generate native reference speech. To improve the quality and accent of the converted speech, we introduce reference encoders which make us capable of utilizing multi-source information. This is motivated by acoustic features extracted from native reference and linguistic information, which are complementary to conventional phonetic posteriorgrams (PPGs), so they can be concatenated as features to improve a baseline system based only on PPGs. Moreover, we optimize model architecture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research