Text-Aware End-to-end Mispronunciation Detection and Diagnosis

Linkai Peng; Yingming Gao; Binghuai Lin; Dengfeng Ke; Yanlu Xie,; Jinsong Zhang

arXiv:2206.07289·cs.SD·June 16, 2022·1 cites

Text-Aware End-to-end Mispronunciation Detection and Diagnosis

Linkai Peng, Yingming Gao, Binghuai Lin, Dengfeng Ke, Yanlu Xie,, Jinsong Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a gating strategy and contrastive loss to improve end-to-end mispronunciation detection by better integrating text and audio features, leading to higher accuracy.

Contribution

It proposes a novel gating mechanism and contrastive loss for more effective text-audio fusion in mispronunciation detection models.

Findings

01

F1 score improved from 57.51% to 61.75% on TIMIT and L2-Arctic datasets.

02

Gating mechanism enhances relevance of audio features during training.

03

Contrastive loss reduces discrepancy between phoneme recognition and MDD objectives.

Abstract

Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT). In the field of assessing the pronunciation quality of constrained speech, the given transcriptions can play the role of a teacher. Conventional methods have fully utilized the prior texts for the model construction or improving the system performance, e.g. forced-alignment and extended recognition networks. Recently, some end-to-end based methods attempt to incorporate the prior texts into model training and preliminarily show the effectiveness. However, previous studies mostly consider applying raw attention mechanism to fuse audio representations with text representations, without taking possible text-pronunciation mismatch into account. In this paper, we present a gating strategy that assigns more importance to the relevant audio features while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vocaliodmiku/wav2vec2mdd-text
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research

MethodsContrastive Learning