Multimodal Clustering with Role Induced Constraints for Speaker   Diarization

Nikolaos Flemotomos; Shrikanth Narayanan

arXiv:2204.00657·eess.AS·July 12, 2022

Multimodal Clustering with Role Induced Constraints for Speaker Diarization

Nikolaos Flemotomos, Shrikanth Narayanan

PDF

Open Access

TL;DR

This paper introduces a multimodal speaker diarization approach that leverages role-based constraints derived from text to enhance clustering accuracy in conversational settings.

Contribution

It proposes a novel method combining text-based role extraction with audio spectral clustering to improve speaker diarization performance.

Findings

01

Improved clustering accuracy over audio-only methods.

02

Effective application in medical and podcast domains.

03

Role constraints guide spectral clustering successfully.

Abstract

Speaker clustering is an essential step in conventional speaker diarization systems and is typically addressed as an audio-only speech processing task. The language used by the participants in a conversation, however, carries additional information that can help improve the clustering performance. This is especially true in conversational interactions, such as business meetings, interviews, and lectures, where specific roles assumed by interlocutors (manager, client, teacher, etc.) are often associated with distinguishable linguistic patterns. In this paper we propose to employ a supervised text-based model to extract speaker roles and then use this information to guide an audio-based spectral clustering step by imposing must-link and cannot-link constraints between segments. The proposed method is applied on two different domains, namely on medical interactions and on podcast episodes,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Music and Audio Processing

MethodsSpectral Clustering