Weakly-supervised word-level pronunciation error detection in non-native English speech
Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro,, Bozena Kostek

TL;DR
This paper introduces a weakly-supervised model for detecting word-level pronunciation errors in non-native English speech, eliminating the need for phonetic transcriptions and improving detection accuracy significantly.
Contribution
It presents a novel multi-task learning approach that leverages phoneme recognizers trained on native speech to detect mispronunciations without detailed phonetic annotations.
Findings
30% improvement in AUC on GUT Isle Corpus for Polish speakers
21.5% improvement in AUC on Isle Corpus for German and Italian speakers
Effective detection of pronunciation errors without phonetic transcriptions
Abstract
We propose a weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech. To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words. The lack of phonetic transcriptions for L2 speech means that the model has to learn only from a weak signal of word-level mispronunciations. Because of that and due to the limited amount of mispronounced L2 speech, the model is more likely to overfit. To limit this risk, we train it in a multi-task setup. In the first task, we estimate the probabilities of word-level mispronunciation. For the second task, we use a phoneme recognizer trained on phonetically transcribed L1 speech that is easily accessible and can be automatically annotated. Compared to state-of-the-art approaches, we improve the accuracy of detecting word-level pronunciation errors in AUC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
