SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models
Han Yin, Yafeng Chen, Chong Deng, Luyao Cheng, Hui Wang, Chao-Hong Tan, Qian Chen, Wen Wang, Xiangang Li

TL;DR
SpeakerLM is a unified multimodal large language model that jointly performs speaker diarization and recognition end-to-end, overcoming limitations of cascaded systems and demonstrating superior performance and robustness across diverse scenarios.
Contribution
The paper introduces SpeakerLM, a novel end-to-end multimodal large language model for SDR with a flexible speaker registration mechanism and multi-stage training, advancing beyond traditional cascaded approaches.
Findings
Outperforms state-of-the-art cascaded baselines on public SDR benchmarks.
Demonstrates strong data scaling capability and generalizability.
Ensures robust SDR performance across diverse registration conditions.
Abstract
The Speaker Diarization and Recognition (SDR) task aims to predict "who spoke when and what" within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
