SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge

Yuxiang Mei; Yuang Zheng; Dongxing Xu; Yanhua Long

arXiv:2507.03343·cs.CL·July 9, 2025

SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge

Yuxiang Mei, Yuang Zheng, Dongxing Xu, Yanhua Long

PDF

Open Access

TL;DR

This paper presents a multilingual conversational speech recognition system that combines parallel speech encoders and a large language model, achieving state-of-the-art results in the INTERSPEECH 2025 MLC-SLM Challenge.

Contribution

The novel integration of parallel pre-trained speech encoders with an LLM and a tri-stage training strategy enhances multilingual ASR performance without additional data.

Findings

01

Achieved 11.76% CER/WER on challenge dataset

02

Outperformed baseline by 8.41 absolute CER/WER

03

Effective use of language-aware prompts

Abstract

This paper describes SHNU multilingual conversational speech recognition system (SHNU-mASR, team name-"maybe"), submitted to Track 1 of the INTERSPEECH 2025 MLC-SLM Challenge. Our system integrates a parallel-speech-encoder architecture with a large language model (LLM) to form a unified multilingual ASR framework. The parallel-speech-encoder consists of two pre-trained encoders, the Whisper-large-v3 encoder and mHuBERT-147 encoder. Their output embeddings are concatenated and fed into the LLM, enabling the model to leverage complementary acoustic and linguistic knowledge and achieve competitive performance. Moreover, we adopt a tri-stage training strategy to jointly update the low-rank adaptation modules and projector parameters of both the speech encoders and the LLM. In addition, we incorporate an additional language-aware prompt at the LLM input to enhance language-specific text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis