Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need
Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Bozena Kostek

TL;DR
This paper introduces innovative speech synthesis techniques to generate synthetic non-native speech, significantly improving pronunciation error detection accuracy in computer-assisted pronunciation training.
Contribution
It presents three novel speech generation methods—P2P, T2S, and S2S—that enhance error detection models and establish new state-of-the-art results in CAPT.
Findings
S2S technique improves error detection AUC by 41%
Synthetic speech generation enhances pronunciation error detection accuracy
Achieved new state-of-the-art in CAPT error detection metrics
Abstract
The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect pronunciation errors with high accuracy (only 60\% precision at 40\%-80\% recall). One of the key problems is the low availability of mispronounced speech that is needed for the reliable training of pronunciation error detection models. If we had a generative model that could mimic non-native speech and produce any amount of training data, then the task of detecting pronunciation errors would be much easier. We present three innovative techniques based on phoneme-to-phoneme (P2P),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
