Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach
Tae Jin Park, Kunal Dhawan, Nithin Koluguri, Jagadeesh Balam

TL;DR
This paper introduces a novel speaker diarization method that integrates large language models with acoustic systems via joint beam search, significantly improving accuracy by leveraging contextual and lexical cues.
Contribution
It presents a new multi-modal decoding approach combining acoustic and lexical information from LLMs to enhance speaker diarization performance.
Findings
Up to 39.8% relative improvement in speaker-attributed word error rate.
Demonstrates that LLMs provide valuable contextual cues beyond acoustic models.
Shows potential for LLMs to benefit other speech processing tasks.
Abstract
Large language models (LLMs) have shown great promise for capturing contextual information in natural language processing tasks. We propose a novel approach to speaker diarization that incorporates the prowess of LLMs to exploit contextual cues in human dialogues. Our method builds upon an acoustic-based speaker diarization system by adding lexical information from an LLM in the inference stage. We model the multi-modal decoding process probabilistically and perform joint acoustic and lexical beam search to incorporate cues from both modalities: audio and text. Our experiments demonstrate that infusing lexical knowledge from the LLM into an acoustics-only diarization system improves overall speaker-attributed word error rate (SA-WER). The experimental results show that LLMs can provide complementary information to acoustic models for the speaker diarization task via proposed beam search…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Music and Audio Processing
