Text-Aware End-to-end Mispronunciation Detection and Diagnosis
Linkai Peng, Yingming Gao, Binghuai Lin, Dengfeng Ke, Yanlu Xie,, Jinsong Zhang

TL;DR
This paper introduces a gating strategy and contrastive loss to improve end-to-end mispronunciation detection by better integrating text and audio features, leading to higher accuracy.
Contribution
It proposes a novel gating mechanism and contrastive loss for more effective text-audio fusion in mispronunciation detection models.
Findings
F1 score improved from 57.51% to 61.75% on TIMIT and L2-Arctic datasets.
Gating mechanism enhances relevance of audio features during training.
Contrastive loss reduces discrepancy between phoneme recognition and MDD objectives.
Abstract
Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT). In the field of assessing the pronunciation quality of constrained speech, the given transcriptions can play the role of a teacher. Conventional methods have fully utilized the prior texts for the model construction or improving the system performance, e.g. forced-alignment and extended recognition networks. Recently, some end-to-end based methods attempt to incorporate the prior texts into model training and preliminarily show the effectiveness. However, previous studies mostly consider applying raw attention mechanism to fuse audio representations with text representations, without taking possible text-pronunciation mismatch into account. In this paper, we present a gating strategy that assigns more importance to the relevant audio features while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
MethodsContrastive Learning
