Improving Non-native Word-level Pronunciation Scoring with Phone-level   Mixup Data Augmentation and Multi-source Information

Kaiqi Fu; Shaojun Gao; Kai Wang; Wei Li; Xiaohai Tian; Zejun Ma

arXiv:2203.01826·eess.AS·March 4, 2022·5 cites

Improving Non-native Word-level Pronunciation Scoring with Phone-level Mixup Data Augmentation and Multi-source Information

Kaiqi Fu, Shaojun Gao, Kai Wang, Wei Li, Xiaohai Tian, Zejun Ma

PDF

Open Access

TL;DR

This paper introduces a phone-level mixup data augmentation technique combined with multi-source features to enhance non-native pronunciation scoring, reducing data requirements and improving correlation with human scores.

Contribution

It proposes a novel phone-level mixup method and multi-source feature integration to improve pronunciation scoring accuracy with less labeled data.

Findings

01

Mixup improves Pearson correlation from 0.567 to 0.61.

02

Achieves similar performance with only 10% of labeled data.

03

Multi-source features further enhance scoring accuracy.

Abstract

Deep learning-based pronunciation scoring models highly rely on the availability of the annotated non-native data, which is costly and has scalability issues. To deal with the data scarcity problem, data augmentation is commonly used for model pretraining. In this paper, we propose a phone-level mixup, a simple yet effective data augmentation method, to improve the performance of word-level pronunciation scoring. Specifically, given a phoneme sequence from lexicon, the artificial augmented word sample can be generated by randomly sampling from the corresponding phone-level features in training data, while the word score is the average of their GOP scores. Benefit from the arbitrary phone-level combination, the mixup is able to generate any word with various pronunciation scores. Moreover, we utilize multi-source information (e.g., MFCC and deep features) to further improve the scoring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMixup