Who Spoke What? A Latent Variable Framework for the Joint Decoding of Multiple Speakers and their Keywords
Harshavardhan Sundar, Thippur V. Sreenivas

TL;DR
This paper introduces a latent variable framework for jointly decoding multiple speakers and their keywords in a mixture signal, using EM algorithm for parameter estimation and achieving 82% accuracy in complex scenarios.
Contribution
It proposes a novel latent variable model that jointly identifies speakers and keywords, advancing multi-speaker speech recognition methods.
Findings
Achieves 82% accuracy in detecting speakers and keywords in mixed signals.
Uses Student's-t mixture models for speaker-specific-keyword modeling.
Employs EM algorithm for maximum likelihood estimation of latent variables.
Abstract
In this paper, we present a latent variable (LV) framework to identify all the speakers and their keywords given a multi-speaker mixture signal. We introduce two separate LVs to denote active speakers and the keywords uttered. The dependency of a spoken keyword on the speaker is modeled through a conditional probability mass function. The distribution of the mixture signal is expressed in terms of the LV mass functions and speaker-specific-keyword models. The proposed framework admits stochastic models, representing the probability density function of the observation vectors given that a particular speaker uttered a specific keyword, as speaker-specific-keyword models. The LV mass functions are estimated in a Maximum Likelihood framework using the Expectation Maximization (EM) algorithm. The active speakers and their keywords are detected as modes of the joint distribution of the two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
