Transsion Multilingual Speech Recognition System for MLC-SLM 2025 Challenge
Xiaoxiao Li, An Zhu, Youhai Jiang, Fengjie Zhu

TL;DR
This paper introduces a multilingual speech recognition system combining pretrained models and fine-tuning, achieving competitive error rates across 11 languages for the MLC-SLM 2025 Challenge.
Contribution
It presents a novel architecture integrating large pretrained speech and language models with trainable modules for multilingual ASR.
Findings
Achieved 9.83% WER/CER on evaluation set
Ranked third among global participants
Effective integration of pretrained models with task-specific adaptation
Abstract
This paper presents the architecture and performance of a novel Multilingual Automatic Speech Recognition (ASR) system developed by the Transsion Speech Team for Track 1 of the MLC-SLM 2025 Challenge. The proposed system comprises three key components: 1) a frozen Whisper-large-v3 based speech encoder, leveraging large-scale pretraining to ensure robust acoustic feature extraction; 2) a trainable adaptor module using Linear-ReLU-Linear transformation mechanisms to effectively align speech and text representations; and 3) a frozen Qwen2.5-7B-Instruct large language model (LLM) integrated with trainable LoRA for optimized contextual linguistic decoding. By systematically combining pretrained models with task specific fine-tuning, the system achieved a word/character error rate (WER/CER) of 9.83% across 11 languages in the evaluation set and ranked third place among global participants.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis
