HiPPO: Exploring A Novel Hierarchical Pronunciation Assessment Approach for Spoken Languages

Bi-Cheng Yan; Hsin-Wei Wang; Fu-An Chao; Tien-Hong Lo; Yung-Chang Hsu; Berlin Chen

arXiv:2512.04964·eess.AS·December 5, 2025·ACL

HiPPO: Exploring A Novel Hierarchical Pronunciation Assessment Approach for Spoken Languages

Bi-Cheng Yan, Hsin-Wei Wang, Fu-An Chao, Tien-Hong Lo, Yung-Chang Hsu, Berlin Chen

PDF

Open Access

TL;DR

This paper introduces HiPPO, a hierarchical model for automatic pronunciation assessment that effectively evaluates unscripted speech, using novel training strategies to improve accuracy and applicability in real-world language learning scenarios.

Contribution

The paper presents a new hierarchical pronunciation assessment model, HiPPO, with contrastive regularization and curriculum learning, specifically designed for unscripted speech in L2 learners.

Findings

01

HiPPO outperforms existing methods on Speechocean762 dataset.

02

Contrastive ordinal regularizer enhances score discrimination.

03

Curriculum learning improves assessment accuracy in unscripted speech.

Abstract

Automatic pronunciation assessment (APA) seeks to quantify a second language (L2) learner's pronunciation proficiency in a target language by offering timely and fine-grained diagnostic feedback. Most existing efforts on APA have predominantly concentrated on highly constrained reading-aloud tasks (where learners are prompted to read a reference text aloud); however, assessing pronunciation quality in unscripted speech (or free-speaking scenarios) remains relatively underexplored. In light of this, we first propose HiPPO, a hierarchical pronunciation assessment model tailored for spoken languages, which evaluates an L2 learner's oral proficiency at multiple linguistic levels based solely on the speech uttered by the learner. To improve the overall accuracy of assessment, a contrastive ordinal regularizer and a curriculum learning strategy are introduced for model training. The former…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Stuttering Research and Treatment