Mixture of Speaker-type PLDAs for Children's Speech Diarization

Jiamin Xie; Suzanna Sia; Paola Garcia; Daniel Povey; Sanjeev Khudanpur

arXiv:2008.13213·eess.AS·September 1, 2020·1 cites

Mixture of Speaker-type PLDAs for Children's Speech Diarization

Jiamin Xie, Suzanna Sia, Paola Garcia, Daniel Povey, Sanjeev Khudanpur

PDF

Open Access

TL;DR

This paper proposes a speaker-type informed mixture of PLDA models for children's speech diarization, demonstrating improved performance by explicitly modeling speaker categories and using vocalization augmentation.

Contribution

It introduces a novel mixture of PLDA models based on speaker type, with a focus on children's speech, and shows performance gains using vocalization augmentation and balanced training data.

Findings

01

Mixture of speaker-type PLDA reduces DER by 1.3% over single PLDA.

02

Vocalization augmentation yields an additional 0.9% DER reduction.

03

Balanced dataset is crucial for optimal mixture model performance.

Abstract

In diarization, the PLDA is typically used to model an inference structure which assumes the variation in speech segments be induced by various speakers. The speaker variation is then learned from the training data. However, human perception can differentiate speakers by age, gender, among other characteristics. In this paper, we investigate a speaker-type informed model that explicitly captures the known variation of speakers. We explore a mixture of three PLDA models, where each model represents an adult female, male, or child category. The weighting of each model is decided by the prior probability of its respective class, which we study. The evaluation is performed on a subset of the BabyTrain corpus. We examine the expected performance gain using the oracle speaker type labels, which yields an 11.7% DER reduction. We introduce a novel baby vocalization augmentation technique and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems