Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment
Tien-Hong Lo, Meng-Ting Tsai, Yao-Ting Sung, Berlin Chen

TL;DR
This paper proposes a systematic framework utilizing zero-shot text-to-speech to generate learner-specific golden speech for improved pronunciation assessment in second language learning, demonstrating significant performance gains.
Contribution
It introduces a novel framework for assessing synthesis models' ability to generate golden speech and explores its effectiveness in automatic pronunciation assessment, a first in this domain.
Findings
Significant improvements in assessment metrics on benchmark datasets.
First exploration of golden speech in ZS-TTS and APA.
Potential for enhanced computer-assisted pronunciation training.
Abstract
Second language (L2) learners can improve their pronunciation by imitating golden speech, especially when the speech that aligns with their respective speech characteristics. This study explores the hypothesis that learner-specific golden speech generated with zero-shot text-to-speech (ZS-TTS) techniques can be harnessed as an effective metric for measuring the pronunciation proficiency of L2 learners. Building on this exploration, the contributions of this study are at least two-fold: 1) design and development of a systematic framework for assessing the ability of a synthesis model to generate golden speech, and 2) in-depth investigations of the effectiveness of using golden speech in automatic pronunciation assessment (APA). Comprehensive experiments conducted on the L2-ARCTIC and Speechocean762 benchmark datasets suggest that our proposed modeling can yield significant performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Employee Welfare and Language Studies · Phonetics and Phonology Research
MethodsAdaptive Pseudo Augmentation
