Exploring Non-Autoregressive End-To-End Neural Modeling For English   Mispronunciation Detection And Diagnosis

Hsin-Wei Wang; Bi-Cheng Yan; Hsuan-Sheng Chiu; Yung-Chang Hsu; Berlin; Chen

arXiv:2111.00844·cs.CL·February 23, 2022·1 cites

Exploring Non-Autoregressive End-To-End Neural Modeling For English Mispronunciation Detection And Diagnosis

Hsin-Wei Wang, Bi-Cheng Yan, Hsuan-Sheng Chiu, Yung-Chang Hsu, Berlin, Chen

PDF

Open Access

TL;DR

This paper introduces a non-autoregressive neural approach for English mispronunciation detection and diagnosis that significantly speeds up inference while maintaining high accuracy, addressing key limitations of existing autoregressive models.

Contribution

It proposes a novel non-autoregressive E2E neural model for MD&D and a pronunciation modeling network to enhance detection effectiveness, improving speed and performance.

Findings

01

Non-autoregressive model achieves faster inference.

02

Maintains competitive accuracy with autoregressive models.

03

Outperforms traditional DNN-HMM based scoring methods.

Abstract

End-to-end (E2E) neural modeling has emerged as one predominant school of thought to develop computer-assisted language training (CAPT) systems, showing competitive performance to conventional pronunciation-scoring based methods. However, current E2E neural methods for CAPT are faced with at least two pivotal challenges. On one hand, most of the E2E methods operate in an autoregressive manner with left-to-right beam search to dictate the pronunciations of an L2 learners. This however leads to very slow inference speed, which inevitably hinders their practical use. On the other hand, E2E neural methods are normally data greedy and meanwhile an insufficient amount of nonnative training data would often reduce their efficacy on mispronunciation detection and diagnosis (MD&D). In response, we put forward a novel MD&D method that leverages non-autoregressive (NAR) E2E neural modeling to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings