Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models
Yuke Lin, Ming Cheng, Ze Li, Beilong Tang, Ming Li

TL;DR
This paper introduces a diarization-aware multi-speaker ASR system that integrates speaker diarization with large language models, improving transcription accuracy in overlapped multi-speaker scenarios by utilizing structured diarization and embeddings.
Contribution
The paper presents a novel LLM-based framework that combines speaker diarization with transcription, addressing limitations of previous SOT-style methods in time-sensitive applications.
Findings
Achieves robust performance in multilingual dyadic conversations.
Excels in complex, high-overlap multi-speaker meeting scenarios.
Demonstrates the potential of LLMs for joint speaker segmentation and transcription.
Abstract
Multi-speaker automatic speech recognition (MS-ASR) faces significant challenges in transcribing overlapped speech, a task critical for applications like meeting transcription and conversational analysis. While serialized output training (SOT)-style methods serve as common solutions, they often discard absolute timing information, limiting their utility in time-sensitive scenarios. Leveraging recent advances in large language models (LLMs) for conversational audio processing, we propose a novel diarization-aware multi-speaker ASR system that integrates speaker diarization with LLM-based transcription. Our framework processes structured diarization inputs alongside frame-level speaker and semantic embeddings, enabling the LLM to generate segment-level transcriptions. Experiments demonstrate that the system achieves robust performance in multilingual dyadic conversations and excels in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
