Leveraging phone-level linguistic-acoustic similarity for utterance-level pronunciation scoring
Wei Liu, Kaiqi Fu, Xiaohai Tian, Shuju Shi, Wei Li, Zejun Ma, Tan, Lee

TL;DR
This paper introduces a novel pronunciation scoring method that explicitly uses linguistic-acoustic similarity at the phone level, combined with a transformer-based model, to improve non-native speech assessment accuracy.
Contribution
The study proposes a new explicit similarity-based approach for pronunciation scoring, incorporating a pre-trained GOP stage and a transformer model, advancing beyond simple embedding concatenation methods.
Findings
Significantly outperforms baseline methods on non-native speech datasets.
Phone embeddings effectively capture native pronunciation attributes.
Similarity-based features improve utterance-level scoring accuracy.
Abstract
Recent studies on pronunciation scoring have explored the effect of introducing phone embeddings as reference pronunciation, but mostly in an implicit manner, i.e., addition or concatenation of reference phone embedding and actual pronunciation of the target phone as the phone-level pronunciation quality representation. In this paper, we propose to use linguistic-acoustic similarity to explicitly measure the deviation of non-native production from its native reference for pronunciation assessment. Specifically, the deviation is first estimated by the cosine similarity between reference phone embedding and corresponding acoustic embedding. Next, a phone-level Goodness of pronunciation (GOP) pre-training stage is introduced to guide this similarity-based learning for better initialization of the aforementioned two embeddings. Finally, a transformer-based hierarchical pronunciation scorer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
