Towards Unsupervised Speaker Diarization System for Multilingual Telephone Calls Using Pre-trained Whisper Model and Mixture of Sparse Autoencoders
Phat Lam, Lam Pham, Truong Nguyen, Dat Ngo, Thinh Pham, Tin Nguyen,, Loi Khanh Nguyen, Alexander Schindler

TL;DR
This paper introduces an unsupervised, multilingual speaker diarization system for telephone calls that leverages the pre-trained Whisper model for embeddings and a novel Mixture of Sparse Autoencoders for clustering, eliminating the need for large annotated datasets.
Contribution
The paper presents a new cluster-based diarization system that supports multiple languages and uses unsupervised learning with a novel autoencoder architecture, advancing multilingual and data-efficient speaker diarization.
Findings
Mix-SAE outperforms other autoencoder-based clustering methods.
The system achieves promising results on CALLHOME and CALLFRIEND datasets.
Supports integration into multi-task speech analysis applications.
Abstract
Existing speaker diarization systems typically rely on large amounts of manually annotated data, which is labor-intensive and difficult to obtain, especially in real-world scenarios. Additionally, language-specific constraints in these systems significantly hinder their effectiveness and scalability in multilingual settings. In this paper, we propose a cluster-based speaker diarization system designed for multilingual telephone call applications. Our proposed system supports multiple languages and eliminates the need for large-scale annotated data during training by utilizing the multilingual Whisper model to extract speaker embeddings. Additionally, we introduce a network architecture called Mixture of Sparse Autoencoders (Mix-SAE) for unsupervised speaker clustering. Experimental results on the evaluation dataset derived from two-speaker subsets of benchmark CALLHOME and CALLFRIEND…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
