Leveraging phone-level linguistic-acoustic similarity for   utterance-level pronunciation scoring

Wei Liu; Kaiqi Fu; Xiaohai Tian; Shuju Shi; Wei Li; Zejun Ma; Tan; Lee

arXiv:2302.10444·eess.AS·March 14, 2023

Leveraging phone-level linguistic-acoustic similarity for utterance-level pronunciation scoring

Wei Liu, Kaiqi Fu, Xiaohai Tian, Shuju Shi, Wei Li, Zejun Ma, Tan, Lee

PDF

Open Access

TL;DR

This paper introduces a novel pronunciation scoring method that explicitly uses linguistic-acoustic similarity at the phone level, combined with a transformer-based model, to improve non-native speech assessment accuracy.

Contribution

The study proposes a new explicit similarity-based approach for pronunciation scoring, incorporating a pre-trained GOP stage and a transformer model, advancing beyond simple embedding concatenation methods.

Findings

01

Significantly outperforms baseline methods on non-native speech datasets.

02

Phone embeddings effectively capture native pronunciation attributes.

03

Similarity-based features improve utterance-level scoring accuracy.

Abstract

Recent studies on pronunciation scoring have explored the effect of introducing phone embeddings as reference pronunciation, but mostly in an implicit manner, i.e., addition or concatenation of reference phone embedding and actual pronunciation of the target phone as the phone-level pronunciation quality representation. In this paper, we propose to use linguistic-acoustic similarity to explicitly measure the deviation of non-native production from its native reference for pronunciation assessment. Specifically, the deviation is first estimated by the cosine similarity between reference phone embedding and corresponding acoustic embedding. Next, a phone-level Goodness of pronunciation (GOP) pre-training stage is introduced to guide this similarity-based learning for better initialization of the aforementioned two embeddings. Finally, a transformer-based hierarchical pronunciation scorer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing