Language Modelling for Speaker Diarization in Telephonic Interviews

Miquel India; Javier Hernando; Jos\'e A.R. Fonollosa

arXiv:2501.17893·eess.AS·January 31, 2025

Language Modelling for Speaker Diarization in Telephonic Interviews

Miquel India, Javier Hernando, Jos\'e A.R. Fonollosa

PDF

Open Access

TL;DR

This paper explores combining language and acoustic features using an iterative LSTM-based system for speaker diarization, demonstrating significant improvements in accuracy on telephone interview data.

Contribution

It introduces a novel fusion approach of linguistic and acoustic features with an iterative LSTM-based classifier for speaker diarization.

Findings

01

84.29% reduction in word-level DER compared to baseline

02

Linguistic content enhances speaker recognition accuracy

03

Fusion of features outperforms acoustic-only systems

Abstract

The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory