AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker   TTS with Accents

Yongmao Zhang; Zhichao Wang; Peiji Yang; Hongshen Sun; Zhisheng Wang,; Lei Xie

arXiv:2210.17305·cs.SD·November 1, 2022·1 cites

AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents

Yongmao Zhang, Zhichao Wang, Peiji Yang, Hongshen Sun, Zhisheng Wang,, Lei Xie

PDF

Open Access

TL;DR

AccentSpeech introduces a three-stage TTS approach that leverages high-quality target speaker data and a novel BN-to-BN transfer module to synthesize accented speech with improved quality and prosody, overcoming noise and data quality issues.

Contribution

The paper proposes a new three-stage accent transfer method using bottleneck features and a BN-to-BN module, enhancing accent transfer quality and prosody in TTS systems.

Findings

01

Effective accent transfer with good prosody achieved

02

Robustness to crowd-sourced data noise demonstrated

03

Improved speech quality in Mandarin TTS accent transfer

Abstract

Learning accent from crowd-sourced data is a feasible way to achieve a target speaker TTS system that can synthesize accent speech. To this end, there are two challenging problems to be solved. First, direct use of the poor acoustic quality crowd-sourced data and the target speaker data in accent transfer will apparently lead to synthetic speech with degraded quality. To mitigate this problem, we take a bottleneck feature (BN) based TTS approach, in which TTS is decomposed into a Text-to-BN (T2BN) module to learn accent and a BN-to-Mel (BN2Mel) module to learn speaker timbre, where neural network based BN feature serves as the intermediate representation that are robust to noise interference. Second, direct training T2BN using the crowd-sourced data in the two-stage system will produce accent speech of target speaker with poor prosody. This is because the the crowd-sourced recordings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing