Improving End-To-End Modeling for Mispronunciation Detection with   Effective Augmentation Mechanisms

Tien-Hong Lo; Yao-Ting Sung; Berlin Chen

arXiv:2110.08731·cs.SD·October 19, 2021·5 cites

Improving End-To-End Modeling for Mispronunciation Detection with Effective Augmentation Mechanisms

Tien-Hong Lo, Yao-Ting Sung, Berlin Chen

PDF

Open Access

TL;DR

This paper enhances end-to-end mispronunciation detection models by introducing input and label augmentation strategies that leverage pretrained acoustic models and transcripts, improving discrimination with limited L2 speech data.

Contribution

The paper proposes two novel augmentation methods for E2E MD models, improving their ability to discriminate phonetic and phonological features in low-resource L2 speech data.

Findings

01

E2E MD models with augmentation outperform existing models.

02

Augmentation strategies improve phonetic discrimination.

03

Models show robustness on L2-ARCTIC dataset.

Abstract

Recently, end-to-end (E2E) models, which allow to take spectral vector sequences of L2 (second-language) learners' utterances as input and produce the corresponding phone-level sequences as output, have attracted much research attention in developing mispronunciation detection (MD) systems. However, due to the lack of sufficient labeled speech data of L2 speakers for model estimation, E2E MD models are prone to overfitting in relation to conventional ones that are built on DNN-HMM acoustic models. To alleviate this critical issue, we in this paper propose two modeling strategies to enhance the discrimination capability of E2E MD models, each of which can implicitly leverage the phonetic and phonological traits encoded in a pretrained acoustic model and contained within reference transcripts of the training data, respectively. The first one is input augmentation, which aims to distill…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques