CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming   Mispronunciation Detection and Diagnosis

Nianzu Zheng; Liqun Deng; Wenyong Huang; Yu Ting Yeung; Baohua Xu,; Yuanyuan Guo; Yasheng Wang; Xiao Chen; Xin Jiang; Qun Liu

arXiv:2111.08191·cs.CL·June 30, 2022

CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis

Nianzu Zheng, Liqun Deng, Wenyong Huang, Yu Ting Yeung, Baohua Xu,, Yuanyuan Guo, Yasheng Wang, Xiao Chen, Xin Jiang, Qun Liu

PDF

Open Access

TL;DR

This paper introduces CoCA-MDD, a streaming end-to-end mispronunciation detection and diagnosis model that reduces latency and improves accuracy by using a conv-transformer and coupled cross-attention mechanisms.

Contribution

The paper presents a novel streaming MDD model with a coupled cross-attention mechanism for integrating acoustic and linguistic features, enabling real-time detection and classification.

Findings

01

Achieves 57.03% F1 score in streaming mode on L2-ARCTIC

02

Attains 0.58 PCC for pronunciation scoring on SpeechOcean762

03

Demonstrates improved performance with system fusion in streaming MDD

Abstract

Mispronunciation detection and diagnosis (MDD) is a popular research focus in computer-aided pronunciation training (CAPT) systems. End-to-end (e2e) approaches are becoming dominant in MDD. However an e2e MDD model usually requires entire speech utterances as input context, which leads to significant time latency especially for long paragraphs. We propose a streaming e2e MDD model called CoCA-MDD. We utilize conv-transformer structure to encode input speech in a streaming manner. A coupled cross-attention (CoCA) mechanism is proposed to integrate frame-level acoustic features with encoded reference linguistic features. CoCA also enables our model to perform mispronunciation classification with whole utterances. The proposed model allows system fusion between the streaming output and mispronunciation classification output for further performance enhancement. We evaluate CoCA-MDD on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing