TL;DR
This paper introduces a Transformer-based model that assesses multiple aspects of non-native English pronunciation at various granularities, improving accuracy over previous single-aspect, single-granularity methods.
Contribution
It proposes a multi-task learning approach with a Goodness Of Pronunciation feature-based Transformer (GOPT) for comprehensive pronunciation assessment.
Findings
GOPT achieves state-of-the-art results on speechocean762
Multi-aspect, multi-granularity modeling improves assessment accuracy
Utilizes a public ASR acoustic model trained on Librispeech
Abstract
Automatic pronunciation assessment is an important technology to help self-directed language learners. While pronunciation quality has multiple aspects including accuracy, fluency, completeness, and prosody, previous efforts typically only model one aspect (e.g., accuracy) at one granularity (e.g., at the phoneme-level). In this work, we explore modeling multi-aspect pronunciation assessment at multiple granularities. Specifically, we train a Goodness Of Pronunciation feature-based Transformer (GOPT) with multi-task learning. Experiments show that GOPT achieves the best results on speechocean762 with a public automatic speech recognition (ASR) acoustic model trained on Librispeech.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Dropout · Softmax
