Automatic Pronunciation Assessment using Self-Supervised Speech   Representation Learning

Eesung Kim; Jae-Jin Jeon; Hyeji Seo; Hoon Kim

arXiv:2204.03863·eess.AS·April 11, 2022·1 cites

Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning

Eesung Kim, Jae-Jin Jeon, Hyeji Seo, Hoon Kim

PDF

Open Access

TL;DR

This paper introduces a novel SSL-based approach for automatic pronunciation assessment that fine-tunes pre-trained models and extracts layer-wise representations to accurately evaluate ESL learners' pronunciation.

Contribution

It presents a new SSL model fine-tuning method combined with layer-wise representation extraction for improved pronunciation scoring in ESL contexts.

Findings

01

Outperforms baseline methods in Pearson correlation coefficient

02

Effective on datasets of Korean ESL children and Speechocean762

03

Analyzes impact of different transformer layer representations

Abstract

Self-supervised learning (SSL) approaches such as wav2vec 2.0 and HuBERT models have shown promising results in various downstream tasks in the speech community. In particular, speech representations learned by SSL models have been shown to be effective for encoding various speech-related characteristics. In this context, we propose a novel automatic pronunciation assessment method based on SSL models. First, the proposed method fine-tunes the pre-trained SSL models with connectionist temporal classification to adapt the English pronunciation of English-as-a-second-language (ESL) learners in a data environment. Then, the layer-wise contextual representations are extracted from all across the transformer layers of the SSL models. Finally, the automatic pronunciation score is estimated using bidirectional long short-term memory with the layer-wise contextual representations and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis