Pronunciation Assessment with Multi-modal Large Language Models

Kaiqi Fu; Linkai Peng; Nan Yang; Shuran Zhou

arXiv:2407.09209·cs.CL·July 19, 2024·1 cites

Pronunciation Assessment with Multi-modal Large Language Models

Kaiqi Fu, Linkai Peng, Nan Yang, Shuran Zhou

PDF

Open Access

TL;DR

This paper introduces a novel multi-modal large language model-based system for automated pronunciation assessment, integrating speech and text features to evaluate learner speech accuracy and fluency.

Contribution

It presents a new scoring framework that combines speech encoding with LLM prompts, achieving competitive results on pronunciation datasets and analyzing the impact of prompts and training strategies.

Findings

01

Achieves competitive scoring accuracy on Speechocean762 dataset

02

Demonstrates the effectiveness of prompt text and training strategies

03

Provides insights through ablation studies on model components

Abstract

Large language models (LLMs), renowned for their powerful conversational abilities, are widely recognized as exceptional tools in the field of education, particularly in the context of automated intelligent instruction systems for language learning. In this paper, we propose a scoring system based on LLMs, motivated by their positive impact on text-related scoring tasks. Specifically, the speech encoder first maps the learner's speech into contextual features. The adapter layer then transforms these features to align with the text embedding in latent space. The assessment task-specific prefix and prompt text are embedded and concatenated with the features generated by the modality adapter layer, enabling the LLMs to predict accuracy and fluency scores. Our experiments demonstrate that the proposed scoring systems achieve competitive results compared to the baselines on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques

MethodsAdapter · ALIGN