Improving Mispronunciation Detection with Wav2vec2-based Momentum   Pseudo-Labeling for Accentedness and Intelligibility Assessment

Mu Yang; Kevin Hirschi; Stephen D. Looney; Okim Kang; John H. L.; Hansen

arXiv:2203.15937·eess.AS·July 13, 2022·1 cites

Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment

Mu Yang, Kevin Hirschi, Stephen D. Looney, Okim Kang, John H. L., Hansen

PDF

Open Access 2 Repos

TL;DR

This paper enhances mispronunciation detection by leveraging Wav2vec 2.0 with dynamic pseudo-labeling, improving phoneme error rates and correlating well with human accentedness and intelligibility assessments.

Contribution

It introduces a novel online ensemble pseudo-labeling approach for fine-tuning SSL models, significantly improving MDD accuracy over traditional methods.

Findings

01

5.35% phoneme error rate reduction

02

2.48% MDD F1 score improvement

03

Strong correlation with human perception

Abstract

Current leading mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition. One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2 speech. In this work, we leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained self-supervised learning (SSL) models. Specifically, we use Wav2vec 2.0 as our SSL model, and fine-tune it using original labeled L2 speech samples plus the created pseudo-labeled L2 speech samples. Our pseudo labels are dynamic and are produced by an ensemble of the online model on-the-fly, which ensures that our model is robust to pseudo label noise. We show that fine-tuning with pseudo labels achieves a 5.35% phoneme error rate reduction and 2.48% MDD F1 score improvement over a labeled-samples-only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems