Mixture of Speaker-type PLDAs for Children's Speech Diarization
Jiamin Xie, Suzanna Sia, Paola Garcia, Daniel Povey, Sanjeev Khudanpur

TL;DR
This paper proposes a speaker-type informed mixture of PLDA models for children's speech diarization, demonstrating improved performance by explicitly modeling speaker categories and using vocalization augmentation.
Contribution
It introduces a novel mixture of PLDA models based on speaker type, with a focus on children's speech, and shows performance gains using vocalization augmentation and balanced training data.
Findings
Mixture of speaker-type PLDA reduces DER by 1.3% over single PLDA.
Vocalization augmentation yields an additional 0.9% DER reduction.
Balanced dataset is crucial for optimal mixture model performance.
Abstract
In diarization, the PLDA is typically used to model an inference structure which assumes the variation in speech segments be induced by various speakers. The speaker variation is then learned from the training data. However, human perception can differentiate speakers by age, gender, among other characteristics. In this paper, we investigate a speaker-type informed model that explicitly captures the known variation of speakers. We explore a mixture of three PLDA models, where each model represents an adult female, male, or child category. The weighting of each model is decided by the prior probability of its respective class, which we study. The evaluation is performed on a subset of the BabyTrain corpus. We examine the expected performance gain using the oracle speaker type labels, which yields an 11.7% DER reduction. We introduce a novel baby vocalization augmentation technique and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
