Speaker diarization using latent space clustering in generative adversarial network
Monisankha Pal, Manoj Kumar, Raghuveer Peri, Tae Jin Park, So Hyun, Kim, Catherine Lord, Somer Bishop, Shrikanth Narayanan

TL;DR
This paper introduces a novel deep latent space clustering approach for speaker diarization using GANs, which significantly outperforms existing x-vector based systems across multiple datasets, including clinical and meeting scenarios.
Contribution
The work presents a joint training framework combining GANs, latent variable recovery, and clustering losses for improved speaker diarization, especially in challenging clinical data.
Findings
Significant DER reduction on AMI, ADOS, and BOSCC datasets.
Embedding fusion with x-vectors improves diarization accuracy.
Outperforms state-of-the-art x-vector based systems.
Abstract
In this work, we propose deep latent space clustering for speaker diarization using generative adversarial network (GAN) backprojection with the help of an encoder network. The proposed diarization system is trained jointly with GAN loss, latent variable recovery loss, and a clustering-specific loss. It uses x-vector speaker embeddings at the input, while the latent variables are sampled from a combination of continuous random variables and discrete one-hot encoded variables using the original speaker labels. We benchmark our proposed system on the AMI meeting corpus, and two child-clinician interaction corpora (ADOS and BOSCC) from the autism diagnosis domain. ADOS and BOSCC contain diagnostic and treatment outcome sessions respectively obtained in clinical settings for verbal children and adolescents with autism. Experimental results show that our proposed system significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
