Segmentation-free Goodness of Pronunciation
Xinwei Cao, Zijian Fan, Torbj{\o}rn Svendsen, Giampiero Salvi

TL;DR
This paper introduces segmentation-free goodness of pronunciation (GOP) methods that leverage CTC-trained acoustic models for improved mispronunciation detection and diagnosis in language learning, achieving state-of-the-art results.
Contribution
It proposes a novel segmentation-free GOP approach (GOP-SF) that uses all possible segmentations and normalizes for peakiness, enabling more accurate pronunciation assessment without pre-segmentation.
Findings
GOP-SF outperforms segmentation-based methods on CMU Kids and speechocean762 datasets.
The method is robust to variations in model peakiness and context.
State-of-the-art phoneme-level pronunciation assessment results are achieved.
Abstract
Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer-aided language learning (CALL) systems. Most systems implementing phoneme-level MDD through goodness of pronunciation (GOP), however, rely on pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general segmentation-free method that takes all possible segmentations of the canonical transcription into account (GOP-SF). We give a theoretical account of our definition of GOP-SF, an implementation that solves potential numerical issues as well as a proper normalization which allows the use of acoustic models with different peakiness over time. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
