A Comparison Study on Infant-Parent Voice Diarization

Junzhe Zhu; Mark Hasegawa-Johnson; Nancy McElwain

arXiv:2011.02698·eess.AS·November 6, 2020

A Comparison Study on Infant-Parent Voice Diarization

Junzhe Zhu, Mark Hasegawa-Johnson, Nancy McElwain

PDF

Open Access

TL;DR

This paper presents a framework for infant-voice diarization using state-of-the-art algorithms, demonstrating improved performance over existing software by optimizing system components and employing multiple-instance learning.

Contribution

It introduces a novel infant-voice diarization system with component analysis and a multiple-instance learning pre-training method, achieving significant performance improvements.

Findings

01

Best system achieved 43.8% DER on test dataset

02

Using convolutional features improves neural diarization performance

03

Pre-training with multiple-instance learning enhances results

Abstract

We design a framework for studying prelinguistic child voicefrom 3 to 24 months based on state-of-the-art algorithms in di-arization. Our system consists of a time-invariant feature ex-tractor, a context-dependent embedding generator, and a clas-sifier. We study the effect of swapping out different compo-nents of the system, as well as changing loss function, to findthe best performance. We also present a multiple-instancelearning technique that allows us to pre-train our parame-ters on larger datasets with coarser segment boundary labels.We found that our best system achieved 43.8% DER on testdataset, compared to 55.4% DER achieved by LENA soft-ware. We also found that using convolutional feature extrac-tor instead of logmel features significantly increases the per-formance of neural diarization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Infant Health and Development