Goodness-of-pronunciation without phoneme time alignment
Jeremy H. M. Wong, Nancy F. Chen

TL;DR
This paper introduces a novel method for speech evaluation that combines weakly-supervised ASR outputs with a cross-attention architecture, eliminating the need for phoneme time alignment and enabling effective evaluation in low-resource languages.
Contribution
It proposes a new approach to extract phoneme and frame-level features without phoneme time alignment, facilitating speech evaluation in low-resource languages using weakly-supervised models.
Findings
Performs comparably with standard features on English datasets
Effective in low-resource Tamil speech datasets
Eliminates the need for phoneme time alignment
Abstract
In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of speech evaluation to low-resource languages. Open-source weakly-supervised models are capable of ASR over many languages, but they are frame-asynchronous and not phonemic, hindering feature extraction for speech evaluation. This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. Phoneme posteriors are computed by mapping ASR hypotheses to a phoneme confusion network. Word instead of phoneme-level speaking rate and duration are used. Phoneme and frame-level features are combined using a cross-attention architecture, obviating phoneme time alignment. This performs comparably with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing
