AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents
Yongmao Zhang, Zhichao Wang, Peiji Yang, Hongshen Sun, Zhisheng Wang,, Lei Xie

TL;DR
AccentSpeech introduces a three-stage TTS approach that leverages high-quality target speaker data and a novel BN-to-BN transfer module to synthesize accented speech with improved quality and prosody, overcoming noise and data quality issues.
Contribution
The paper proposes a new three-stage accent transfer method using bottleneck features and a BN-to-BN module, enhancing accent transfer quality and prosody in TTS systems.
Findings
Effective accent transfer with good prosody achieved
Robustness to crowd-sourced data noise demonstrated
Improved speech quality in Mandarin TTS accent transfer
Abstract
Learning accent from crowd-sourced data is a feasible way to achieve a target speaker TTS system that can synthesize accent speech. To this end, there are two challenging problems to be solved. First, direct use of the poor acoustic quality crowd-sourced data and the target speaker data in accent transfer will apparently lead to synthetic speech with degraded quality. To mitigate this problem, we take a bottleneck feature (BN) based TTS approach, in which TTS is decomposed into a Text-to-BN (T2BN) module to learn accent and a BN-to-Mel (BN2Mel) module to learn speaker timbre, where neural network based BN feature serves as the intermediate representation that are robust to noise interference. Second, direct training T2BN using the crowd-sourced data in the two-stage system will produce accent speech of target speaker with poor prosody. This is because the the crowd-sourced recordings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
