End-to-End Mispronunciation Detection and Diagnosis From Raw Waveforms
Bi-Cheng Yan, Berlin Chen

TL;DR
This paper introduces an end-to-end neural model for mispronunciation detection that processes raw speech waveforms using SincNet, achieving comparable performance to traditional methods while improving interpretability and adaptability across different non-native speakers.
Contribution
The study presents a novel E2E MDD model utilizing SincNet to directly process raw waveforms, reducing parameters and enhancing interpretability compared to conventional CNN-based features.
Findings
SincNet filters adapt quickly to different non-native speakers.
Model achieves comparable detection performance to state-of-the-art methods.
Significant improvements in phone error rate and diagnosis accuracy.
Abstract
Mispronunciation detection and diagnosis (MDD) is designed to identify pronunciation errors and provide instructive feedback to guide non-native language learners, which is a core component in computer-assisted pronunciation training (CAPT) systems. However, MDD often suffers from the data-sparsity problem due to that collecting non-native data and the associated annotations is time-consuming and labor-intensive. To address this issue, we explore a fully end-to-end (E2E) neural model for MDD, which processes learners' speech directly based on raw waveforms. Compared to conventional hand-crafted acoustic features, raw waveforms retain more acoustic phenomena and potentially can help neural networks discover better and more customized representations. To this end, our MDD model adopts a co-called SincNet module to take input a raw waveform and covert it to a suitable vector representation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
