Language Modelling for Speaker Diarization in Telephonic Interviews
Miquel India, Javier Hernando, Jos\'e A.R. Fonollosa

TL;DR
This paper explores combining language and acoustic features using an iterative LSTM-based system for speaker diarization, demonstrating significant improvements in accuracy on telephone interview data.
Contribution
It introduces a novel fusion approach of linguistic and acoustic features with an iterative LSTM-based classifier for speaker diarization.
Findings
84.29% reduction in word-level DER compared to baseline
Linguistic content enhances speaker recognition accuracy
Fusion of features outperforms acoustic-only systems
Abstract
The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
