NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025

Yizhou Peng; Bin Wang; Yi-Wen Chao; Ziyang Ma; Haoyang Zhang; Hexin Liu; Xie Chen; and Eng Siong Chng

arXiv:2506.13339·cs.CL·July 8, 2025

NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025

Yizhou Peng, Bin Wang, Yi-Wen Chao, Ziyang Ma, Haoyang Zhang, Hexin Liu, Xie Chen, and Eng Siong Chng

PDF

Open Access

TL;DR

This paper describes NTU Speechlab's multilingual ASR system for the Interspeech 2025 challenge, achieving significant error rate reduction through innovative model strategies and data techniques.

Contribution

The paper introduces novel use of language-specific prompts and model averaging to enhance multilingual speech recognition performance.

Findings

01

Reduced Mix Error Rate from 20.2% to 10.6%.

02

Achieved 5th place in the challenge.

03

Demonstrated effectiveness of language prompts and model averaging.

Abstract

This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I), where we achieved 5th place. We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies. In particular, language-specific prompts and model averaging techniques were instrumental in boosting system performance across diverse languages. Compared to the initial baseline system, our final model reduced the average Mix Error Rate from 20.2% to 10.6%, representing an absolute improvement of 9.6% (a relative improvement of 48%) on the evaluation set. Our results demonstrate the effectiveness of our approach and offer practical insights for future Speech Large Language Models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis