Goodness-of-pronunciation without phoneme time alignment

Jeremy H. M. Wong; Nancy F. Chen

arXiv:2603.25150·cs.CL·March 27, 2026

Goodness-of-pronunciation without phoneme time alignment

Jeremy H. M. Wong, Nancy F. Chen

PDF

Open Access

TL;DR

This paper introduces a novel method for speech evaluation that combines weakly-supervised ASR outputs with a cross-attention architecture, eliminating the need for phoneme time alignment and enabling effective evaluation in low-resource languages.

Contribution

It proposes a new approach to extract phoneme and frame-level features without phoneme time alignment, facilitating speech evaluation in low-resource languages using weakly-supervised models.

Findings

01

Performs comparably with standard features on English datasets

02

Effective in low-resource Tamil speech datasets

03

Eliminates the need for phoneme time alignment

Abstract

In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of speech evaluation to low-resource languages. Open-source weakly-supervised models are capable of ASR over many languages, but they are frame-asynchronous and not phonemic, hindering feature extraction for speech evaluation. This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. Phoneme posteriors are computed by mapping ASR hypotheses to a phoneme confusion network. Word instead of phoneme-level speaking rate and duration are used. Phoneme and frame-level features are combined using a cross-attention architecture, obviating phoneme time alignment. This performs comparably with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing