DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models
Li Li, Ming Cheng, Weixin Zhu, Yannan Wang, Juan Liu, Ming Li

TL;DR
DM-ASR leverages diarization cues and large language models to improve multi-speaker speech recognition by reformulating it as a structured dialogue generation task.
Contribution
It introduces a novel diarization-aware framework that decouples speaker-temporal structure from linguistic content using large language models.
Findings
Achieves strong performance on Mandarin and English benchmarks.
Outperforms existing unified multi-speaker ASR approaches.
Enables richer structured outputs with optional timestamp prediction.
Abstract
Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potential of unified modeling for this task, but jointly learning speaker attribution, temporal structure, and lexical recognition remains difficult and data-intensive. At the current stage, leveraging reliable speaker diarization as an explicit structural prior provides a practical and efficient way to simplify this task. To effectively exploit such priors, we propose DM-ASR, a diarization-aware multi-speaker ASR framework that reformulates the task as a multi-turn dialogue generation process. Given an audio chunk and diarization results, DM-ASR decomposes transcription into a sequence of speaker- and time-conditioned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
