Uncertainty Quantification in Machine Learning for Joint Speaker   Diarization and Identification

Simon W. McKnight; Aidan O. T. Hogg; Vincent W. Neo; Patrick A. Naylor

arXiv:2312.16763·eess.AS·January 2, 2024·1 cites

Uncertainty Quantification in Machine Learning for Joint Speaker Diarization and Identification

Simon W. McKnight, Aidan O. T. Hogg, Vincent W. Neo, Patrick A. Naylor

PDF

Open Access

TL;DR

This paper explores uncertainty quantification in machine learning models for joint speaker diarization and identification, demonstrating how aleatoric and epistemic uncertainties can improve model reliability and performance, especially with model ensembles.

Contribution

It introduces a comprehensive analysis of uncertainty types in JSID models using CNNs and LSTMs, and proposes methods to leverage these uncertainties for enhanced speaker diarization accuracy.

Findings

01

Models on both $\

02

[0m

03

Model ensembles with Kalman filter smoothing outperform individual models in overlapping speaker scenarios.

Abstract

This paper studies modulation spectrum features ( $Φ$ ) and mel-frequency cepstral coefficients ( $Ψ$ ) in joint speaker diarization and identification (JSID). JSID is important as speaker diarization on its own to distinguish speakers is insufficient for many applications, it is often necessary to identify speakers as well. Machine learning models are set up using convolutional neural networks (CNNs) on $Φ$ and recurrent neural networks $\unicode x 2013$ long short-term memory (LSTMs) on $Ψ$ , then concatenating into fully connected layers. Experiment 1 shows models on both $Φ$ and $Ψ$ have better diarization error rates (DERs) than models on either alone; a CNN on $Φ$ has DER 29.09\%, compared to 27.78\% for a LSTM on $Ψ$ and 19.44\% for a model on both. Experiment 1 also investigates aleatoric uncertainties and shows the model on both $Φ$ and $Ψ$ has mean…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsSparse Evolutionary Training · Tanh Activation · Sigmoid Activation · Long Short-Term Memory · Dropout · Monte Carlo Dropout