Multimodal Clustering with Role Induced Constraints for Speaker Diarization
Nikolaos Flemotomos, Shrikanth Narayanan

TL;DR
This paper introduces a multimodal speaker diarization approach that leverages role-based constraints derived from text to enhance clustering accuracy in conversational settings.
Contribution
It proposes a novel method combining text-based role extraction with audio spectral clustering to improve speaker diarization performance.
Findings
Improved clustering accuracy over audio-only methods.
Effective application in medical and podcast domains.
Role constraints guide spectral clustering successfully.
Abstract
Speaker clustering is an essential step in conventional speaker diarization systems and is typically addressed as an audio-only speech processing task. The language used by the participants in a conversation, however, carries additional information that can help improve the clustering performance. This is especially true in conversational interactions, such as business meetings, interviews, and lectures, where specific roles assumed by interlocutors (manager, client, teacher, etc.) are often associated with distinguishable linguistic patterns. In this paper we propose to employ a supervised text-based model to extract speaker roles and then use this information to guide an audio-based spectral clustering step by imposing must-link and cannot-link constraints between segments. The proposed method is applied on two different domains, namely on medical interactions and on podcast episodes,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Music and Audio Processing
MethodsSpectral Clustering
