Enhancing Speaker Diarization with Large Language Models: A Contextual   Beam Search Approach

Tae Jin Park; Kunal Dhawan; Nithin Koluguri; Jagadeesh Balam

arXiv:2309.05248·eess.AS·September 15, 2023

Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach

Tae Jin Park, Kunal Dhawan, Nithin Koluguri, Jagadeesh Balam

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces a novel speaker diarization method that integrates large language models with acoustic systems via joint beam search, significantly improving accuracy by leveraging contextual and lexical cues.

Contribution

It presents a new multi-modal decoding approach combining acoustic and lexical information from LLMs to enhance speaker diarization performance.

Findings

01

Up to 39.8% relative improvement in speaker-attributed word error rate.

02

Demonstrates that LLMs provide valuable contextual cues beyond acoustic models.

03

Shows potential for LLMs to benefit other speech processing tasks.

Abstract

Large language models (LLMs) have shown great promise for capturing contextual information in natural language processing tasks. We propose a novel approach to speaker diarization that incorporates the prowess of LLMs to exploit contextual cues in human dialogues. Our method builds upon an acoustic-based speaker diarization system by adding lexical information from an LLM in the inference stage. We model the multi-modal decoding process probabilistically and perform joint acoustic and lexical beam search to incorporate cues from both modalities: audio and text. Our experiments demonstrate that infusing lexical knowledge from the LLM into an acoustics-only diarization system improves overall speaker-attributed word error rate (SA-WER). The experimental results show that LLMs can provide complementary information to acoustic models for the speaker diarization task via proposed beam search…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tango4j/llm_speaker_tagging
none

Models

🤗
GenSEC-LLM/SLT-Task2-ngram-baseline
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Music and Audio Processing