Pronunciation Assessment with Multi-modal Large Language Models
Kaiqi Fu, Linkai Peng, Nan Yang, Shuran Zhou

TL;DR
This paper introduces a novel multi-modal large language model-based system for automated pronunciation assessment, integrating speech and text features to evaluate learner speech accuracy and fluency.
Contribution
It presents a new scoring framework that combines speech encoding with LLM prompts, achieving competitive results on pronunciation datasets and analyzing the impact of prompts and training strategies.
Findings
Achieves competitive scoring accuracy on Speechocean762 dataset
Demonstrates the effectiveness of prompt text and training strategies
Provides insights through ablation studies on model components
Abstract
Large language models (LLMs), renowned for their powerful conversational abilities, are widely recognized as exceptional tools in the field of education, particularly in the context of automated intelligent instruction systems for language learning. In this paper, we propose a scoring system based on LLMs, motivated by their positive impact on text-related scoring tasks. Specifically, the speech encoder first maps the learner's speech into contextual features. The adapter layer then transforms these features to align with the text embedding in latent space. The assessment task-specific prefix and prompt text are embedded and concatenated with the features generated by the modality adapter layer, enabling the LLMs to predict accuracy and fluency scores. Our experiments demonstrate that the proposed scoring systems achieve competitive results compared to the baselines on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
MethodsAdapter · ALIGN
