Towards Unsupervised Speaker Diarization System for Multilingual   Telephone Calls Using Pre-trained Whisper Model and Mixture of Sparse   Autoencoders

Phat Lam; Lam Pham; Truong Nguyen; Dat Ngo; Thinh Pham; Tin Nguyen,; Loi Khanh Nguyen; Alexander Schindler

arXiv:2407.01963·eess.AS·September 13, 2024

Towards Unsupervised Speaker Diarization System for Multilingual Telephone Calls Using Pre-trained Whisper Model and Mixture of Sparse Autoencoders

Phat Lam, Lam Pham, Truong Nguyen, Dat Ngo, Thinh Pham, Tin Nguyen,, Loi Khanh Nguyen, Alexander Schindler

PDF

Open Access

TL;DR

This paper introduces an unsupervised, multilingual speaker diarization system for telephone calls that leverages the pre-trained Whisper model for embeddings and a novel Mixture of Sparse Autoencoders for clustering, eliminating the need for large annotated datasets.

Contribution

The paper presents a new cluster-based diarization system that supports multiple languages and uses unsupervised learning with a novel autoencoder architecture, advancing multilingual and data-efficient speaker diarization.

Findings

01

Mix-SAE outperforms other autoencoder-based clustering methods.

02

The system achieves promising results on CALLHOME and CALLFRIEND datasets.

03

Supports integration into multi-task speech analysis applications.

Abstract

Existing speaker diarization systems typically rely on large amounts of manually annotated data, which is labor-intensive and difficult to obtain, especially in real-world scenarios. Additionally, language-specific constraints in these systems significantly hinder their effectiveness and scalability in multilingual settings. In this paper, we propose a cluster-based speaker diarization system designed for multilingual telephone call applications. Our proposed system supports multiple languages and eliminates the need for large-scale annotated data during training by utilizing the multilingual Whisper model to extract speaker embeddings. Additionally, we introduce a network architecture called Mixture of Sparse Autoencoders (Mix-SAE) for unsupervised speaker clustering. Experimental results on the evaluation dataset derived from two-speaker subsets of benchmark CALLHOME and CALLFRIEND…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing