A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations

Phurich Saengthong; Boonnithi Jiaramaneepinit; Sheng Li; Manabu Okumura; Takahiro Shinozaki

arXiv:2507.02927·cs.CL·July 8, 2025

A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations

Phurich Saengthong, Boonnithi Jiaramaneepinit, Sheng Li, Manabu Okumura, Takahiro Shinozaki

PDF

TL;DR

This paper introduces a unified speech LLM capable of jointly performing diarization and speech recognition in multilingual conversations, significantly improving performance on complex conversational tasks.

Contribution

It presents a novel end-to-end model that reformulates training and inference for joint diarization and ASR, addressing ambiguity in pre-segmented audio data.

Findings

01

Achieved a 54.87% relative improvement in tcpWER/tcpCER over baseline.

02

Ranked 8th overall in the MLC-SLM Challenge.

03

Demonstrated effectiveness even with a smaller LLM backbone.

Abstract

Speech Large Language Models (Speech LLMs) have emerged as a crucial paradigm in recent years, extending the capabilities of traditional LLMs to speech tasks such as automatic speech recognition (ASR) and spoken dialogue modeling. However, their effectiveness in real-world multilingual conversations remains limited by the scarcity of data that captures natural conversational phenomena. To address this, the MLC-SLM Challenge provides a multilingual conversational dataset and evaluates models on two tasks: ASR with oracle segmentation (Task I) and joint diarization and recognition without oracle information (Task II). In this paper, we focus on Task II and propose a unified speech LLM that jointly performs diarization and ASR in an end-to-end manner. By reformulating the training data format and modifying the inference procedure, our model addresses the ambiguity inherent in pre-segmented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.