Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment

Tien-Hong Lo; Meng-Ting Tsai; Yao-Ting Sung; Berlin Chen

arXiv:2409.07151·eess.AS·July 29, 2025

Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment

Tien-Hong Lo, Meng-Ting Tsai, Yao-Ting Sung, Berlin Chen

PDF

Open Access

TL;DR

This paper proposes a systematic framework utilizing zero-shot text-to-speech to generate learner-specific golden speech for improved pronunciation assessment in second language learning, demonstrating significant performance gains.

Contribution

It introduces a novel framework for assessing synthesis models' ability to generate golden speech and explores its effectiveness in automatic pronunciation assessment, a first in this domain.

Findings

01

Significant improvements in assessment metrics on benchmark datasets.

02

First exploration of golden speech in ZS-TTS and APA.

03

Potential for enhanced computer-assisted pronunciation training.

Abstract

Second language (L2) learners can improve their pronunciation by imitating golden speech, especially when the speech that aligns with their respective speech characteristics. This study explores the hypothesis that learner-specific golden speech generated with zero-shot text-to-speech (ZS-TTS) techniques can be harnessed as an effective metric for measuring the pronunciation proficiency of L2 learners. Building on this exploration, the contributions of this study are at least two-fold: 1) design and development of a systematic framework for assessing the ability of a synthesis model to generate golden speech, and 2) in-depth investigations of the effectiveness of using golden speech in automatic pronunciation assessment (APA). Comprehensive experiments conducted on the L2-ARCTIC and Speechocean762 benchmark datasets suggest that our proposed modeling can yield significant performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Employee Welfare and Language Studies · Phonetics and Phonology Research

MethodsAdaptive Pseudo Augmentation