L1-aware Multilingual Mispronunciation Detection Framework
Yassine El Kheir, Shammur Absar Chowdhury, Ahmed Ali

TL;DR
This paper presents a multilingual mispronunciation detection framework that incorporates L1-aware speech representations, improving accuracy and robustness across multiple languages and datasets.
Contribution
The paper introduces L1-MultiMDD, a novel end-to-end multilingual mispronunciation detection architecture with L1-aware embeddings and multi-task training, enhancing detection performance.
Findings
Significant reduction in phoneme error rate (PER) across languages.
Improved false rejection rate (FRR) demonstrating robustness.
Effective generalization to unseen datasets.
Abstract
The phonological discrepancies between a speaker's native (L1) and the non-native language (L2) serves as a major factor for mispronunciation. This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation. An end-to-end speech encoder is trained on the input signal and its corresponding reference phoneme sequence. First, an attention mechanism is deployed to align the input audio with the reference phoneme sequence. Afterwards, the L1-L2-speech embedding are extracted from an auxiliary model, pretrained in a multi-task setup identifying L1 and L2 language, and are infused with the primary network. Finally, the L1-MultiMDD is then optimized for a unified multilingual phoneme recognition task using connectionist temporal classification (CTC) loss for the target languages: English, Arabic, and Mandarin. Our experiments demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsALIGN
