The DKU System for Multi-Speaker Automatic Speech Recognition in MLC-SLM Challenge

Yuke Lin; Ming Cheng; Ze Li; Ming Li

arXiv:2507.09499·eess.AS·July 15, 2025

The DKU System for Multi-Speaker Automatic Speech Recognition in MLC-SLM Challenge

Yuke Lin, Ming Cheng, Ze Li, Ming Li

PDF

Open Access

TL;DR

The paper introduces a multi-speaker speech recognition system that integrates diarization and language modeling, achieving state-of-the-art results on the MLC-SLM dataset without relying on oracle labels.

Contribution

It presents a novel diarization-aware LLM framework with multilingual fine-tuning, improving multi-speaker ASR performance significantly.

Findings

01

Achieved tcpWER of 23.56% on development set

02

Achieved tcpWER of 18.08% on test set

03

Outperformed the official baseline substantially

Abstract

We present the DKU system for Task 2 of the MLC-SLM Challenge, which aims to perform multi-speaker automatic speech recognition directly from raw audio without Oracle speaker labels or time boundaries. Our approach builds upon a diarization-aware framework integrating speaker embeddings and temporal utterance boundaries into a Qwen2.5-based large language model (LLM). Then, we enhance the system's multilingual performance by fine-tuning language-specific adapters and LoRA modules within the LLM decoder. Finally, our system achieves the tcpWER of 23.56\% and 18.08\% on the development and test sets of the MLC-SLM dataset, substantially outperforming the official baseline.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis