End-to-End Mispronunciation Detection and Diagnosis From Raw Waveforms

Bi-Cheng Yan; Berlin Chen

arXiv:2103.03023·eess.AS·June 2, 2021·1 cites

End-to-End Mispronunciation Detection and Diagnosis From Raw Waveforms

Bi-Cheng Yan, Berlin Chen

PDF

Open Access

TL;DR

This paper introduces an end-to-end neural model for mispronunciation detection that processes raw speech waveforms using SincNet, achieving comparable performance to traditional methods while improving interpretability and adaptability across different non-native speakers.

Contribution

The study presents a novel E2E MDD model utilizing SincNet to directly process raw waveforms, reducing parameters and enhancing interpretability compared to conventional CNN-based features.

Findings

01

SincNet filters adapt quickly to different non-native speakers.

02

Model achieves comparable detection performance to state-of-the-art methods.

03

Significant improvements in phone error rate and diagnosis accuracy.

Abstract

Mispronunciation detection and diagnosis (MDD) is designed to identify pronunciation errors and provide instructive feedback to guide non-native language learners, which is a core component in computer-assisted pronunciation training (CAPT) systems. However, MDD often suffers from the data-sparsity problem due to that collecting non-native data and the associated annotations is time-consuming and labor-intensive. To address this issue, we explore a fully end-to-end (E2E) neural model for MDD, which processes learners' speech directly based on raw waveforms. Compared to conventional hand-crafted acoustic features, raw waveforms retain more acoustic phenomena and potentially can help neural networks discover better and more customized representations. To this end, our MDD model adopts a co-called SincNet module to take input a raw waveform and covert it to a suitable vector representation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research