SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge
Yuxiang Mei, Yuang Zheng, Dongxing Xu, Yanhua Long

TL;DR
This paper presents a multilingual conversational speech recognition system that combines parallel speech encoders and a large language model, achieving state-of-the-art results in the INTERSPEECH 2025 MLC-SLM Challenge.
Contribution
The novel integration of parallel pre-trained speech encoders with an LLM and a tri-stage training strategy enhances multilingual ASR performance without additional data.
Findings
Achieved 11.76% CER/WER on challenge dataset
Outperformed baseline by 8.41 absolute CER/WER
Effective use of language-aware prompts
Abstract
This paper describes SHNU multilingual conversational speech recognition system (SHNU-mASR, team name-"maybe"), submitted to Track 1 of the INTERSPEECH 2025 MLC-SLM Challenge. Our system integrates a parallel-speech-encoder architecture with a large language model (LLM) to form a unified multilingual ASR framework. The parallel-speech-encoder consists of two pre-trained encoders, the Whisper-large-v3 encoder and mHuBERT-147 encoder. Their output embeddings are concatenated and fed into the LLM, enabling the model to leverage complementary acoustic and linguistic knowledge and achieve competitive performance. Moreover, we adopt a tri-stage training strategy to jointly update the low-rank adaptation modules and projector parameters of both the speech encoders and the LLM. In addition, we incorporate an additional language-aware prompt at the LLM input to enhance language-specific text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
