Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning

Yu-Hsuan Fang; Tien-Hong Lo; Yao-Ting Sung; Berlin Chen

arXiv:2508.12591·cs.CL·August 19, 2025

Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning

Yu-Hsuan Fang, Tien-Hong Lo, Yao-Ting Sung, Berlin Chen

PDF

Open Access

TL;DR

This paper explores the use of Multimodal Large Language Models for comprehensive automated speaking assessment, introducing a curriculum learning strategy to improve delivery evaluation and outperform existing methods.

Contribution

It is the first systematic study of MLLM for ASA, proposing Speech-First Multimodal Training to enhance speech modeling and assessment accuracy.

Findings

01

MLLM-based systems improve holistic assessment PCC from 0.783 to 0.846.

02

SFMT enhances delivery aspect evaluation with a 4% accuracy gain.

03

The approach offers a new avenue for multimodal automated speaking assessment.

Abstract

Traditional Automated Speaking Assessment (ASA) systems exhibit inherent modality limitations: text-based approaches lack acoustic information while audio-based methods miss semantic context. Multimodal Large Language Models (MLLM) offer unprecedented opportunities for comprehensive ASA by simultaneously processing audio and text within unified frameworks. This paper presents a very first systematic study of MLLM for comprehensive ASA, demonstrating the superior performance of MLLM across the aspects of content and language use . However, assessment on the delivery aspect reveals unique challenges, which is deemed to require specialized training strategies. We thus propose Speech-First Multimodal Training (SFMT), leveraging a curriculum learning principle to establish more robust modeling foundations of speech before cross-modal synergetic fusion. A series of experiments on a benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Technology and Assessment