CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
Nianzu Zheng, Liqun Deng, Wenyong Huang, Yu Ting Yeung, Baohua Xu,, Yuanyuan Guo, Yasheng Wang, Xiao Chen, Xin Jiang, Qun Liu

TL;DR
This paper introduces CoCA-MDD, a streaming end-to-end mispronunciation detection and diagnosis model that reduces latency and improves accuracy by using a conv-transformer and coupled cross-attention mechanisms.
Contribution
The paper presents a novel streaming MDD model with a coupled cross-attention mechanism for integrating acoustic and linguistic features, enabling real-time detection and classification.
Findings
Achieves 57.03% F1 score in streaming mode on L2-ARCTIC
Attains 0.58 PCC for pronunciation scoring on SpeechOcean762
Demonstrates improved performance with system fusion in streaming MDD
Abstract
Mispronunciation detection and diagnosis (MDD) is a popular research focus in computer-aided pronunciation training (CAPT) systems. End-to-end (e2e) approaches are becoming dominant in MDD. However an e2e MDD model usually requires entire speech utterances as input context, which leads to significant time latency especially for long paragraphs. We propose a streaming e2e MDD model called CoCA-MDD. We utilize conv-transformer structure to encode input speech in a streaming manner. A coupled cross-attention (CoCA) mechanism is proposed to integrate frame-level acoustic features with encoded reference linguistic features. CoCA also enables our model to perform mispronunciation classification with whole utterances. The proposed model allows system fusion between the streaming output and mispronunciation classification output for further performance enhancement. We evaluate CoCA-MDD on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing
