The NTNU System at the S&I Challenge 2025 SLA Open Track
Hong-Yun Lin, Tien-Hong Lo, Yu-Hsuan Fang, Jhen-Ke Lin, Chung-Chun Wang, Hao-Chien Lu, Berlin Chen

TL;DR
This paper presents a multimodal system combining wav2vec 2.0 and Phi-4 MLLM for spoken language assessment, achieving competitive RMSE scores and outperforming baseline models in the S&I Challenge 2025.
Contribution
It introduces a novel integration of acoustic and semantic models via score fusion to improve SLA accuracy.
Findings
Achieved RMSE of 0.375, ranking second in the challenge.
Outperformed baseline systems with higher RMSE scores.
Demonstrated the effectiveness of multimodal fusion in SLA tasks.
Abstract
A recent line of research on spoken language assessment (SLA) employs neural models such as BERT and wav2vec 2.0 (W2V) to evaluate speaking proficiency across linguistic and acoustic modalities. Although both models effectively capture features relevant to oral competence, each exhibits modality-specific limitations. BERT-based methods rely on ASR transcripts, which often fail to capture prosodic and phonetic cues for SLA. In contrast, W2V-based methods excel at modeling acoustic features but lack semantic interpretability. To overcome these limitations, we propose a system that integrates W2V with Phi-4 multimodal large language model (MLLM) through a score fusion strategy. The proposed system achieves a root mean square error (RMSE) of 0.375 on the official test set of the Speak & Improve Challenge 2025, securing second place in the competition. For comparison, the RMSEs of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Attention Dropout · Softmax · WordPiece · Weight Decay · Multi-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Dropout
