The TEA-ASLP System for Multilingual Conversational Speech Recognition and Speech Diarization in MLC-SLM 2025 Challenge

Hongfei Xue; Kaixun Huang; Zhikai Zhou; Shen Huang; Shidong Shang

arXiv:2507.18051·cs.SD·July 25, 2025

The TEA-ASLP System for Multilingual Conversational Speech Recognition and Speech Diarization in MLC-SLM 2025 Challenge

Hongfei Xue, Kaixun Huang, Zhikai Zhou, Shen Huang, Shidong Shang

PDF

Open Access

TL;DR

This paper describes the TEA-ASLP system for multilingual conversational speech recognition and diarization, achieving top challenge results through model enhancements and data strategies.

Contribution

The paper introduces novel multilingual modeling techniques and prompt strategies that significantly improve speech recognition and diarization performance.

Findings

01

30.8% reduction in WER over baseline

02

Final WER of 9.60% in Task I

03

Second place in speech diarization challenge

Abstract

This paper presents the TEA-ASLP's system submitted to the MLC-SLM 2025 Challenge, addressing multilingual conversational automatic speech recognition (ASR) in Task I and speech diarization ASR in Task II. For Task I, we enhance Ideal-LLM model by integrating known language identification and a multilingual MOE LoRA structure, along with using CTC-predicted tokens as prompts to improve autoregressive generation. The model is trained on approximately 180k hours of multilingual ASR data. In Task II, we replace the baseline English-Chinese speaker diarization model with a more suitable English-only version. Our approach achieves a 30.8% reduction in word error rate (WER) compared to the baseline speech language model, resulting in a final WER of 9.60% in Task I and a time-constrained minimum-permutation WER of 17.49% in Task II, earning first and second place in the respective challenge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis