SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models

Han Yin; Yafeng Chen; Chong Deng; Luyao Cheng; Hui Wang; Chao-Hong Tan; Qian Chen; Wen Wang; Xiangang Li

arXiv:2508.06372·cs.SD·January 6, 2026

SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models

Han Yin, Yafeng Chen, Chong Deng, Luyao Cheng, Hui Wang, Chao-Hong Tan, Qian Chen, Wen Wang, Xiangang Li

PDF

Open Access 1 Video

TL;DR

SpeakerLM is a unified multimodal large language model that jointly performs speaker diarization and recognition end-to-end, overcoming limitations of cascaded systems and demonstrating superior performance and robustness across diverse scenarios.

Contribution

The paper introduces SpeakerLM, a novel end-to-end multimodal large language model for SDR with a flexible speaker registration mechanism and multi-stage training, advancing beyond traditional cascaded approaches.

Findings

01

Outperforms state-of-the-art cascaded baselines on public SDR benchmarks.

02

Demonstrates strong data scaling capability and generalizability.

03

Ensures robust SDR performance across diverse registration conditions.

Abstract

The Speaker Diarization and Recognition (SDR) task aims to predict "who spoke when and what" within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques