Generation of Speaker Representations Using Heterogeneous Training Batch   Assembly

Yu-Huai Peng; Hung-Shin Lee; Pin-Tuan Huang; Hsin-Min Wang

arXiv:2203.16646·cs.SD·April 1, 2022

Generation of Speaker Representations Using Heterogeneous Training Batch Assembly

Yu-Huai Peng, Hung-Shin Lee, Pin-Tuan Huang, Hsin-Min Wang

PDF

Open Access

TL;DR

This paper introduces a CNN-based speaker modeling approach that accounts for heterogeneity in training data, improving diarization performance by generating meaningful embeddings for multi-speaker segments.

Contribution

It proposes a novel training scheme with synthetic multi-speaker segments and soft labels, enhancing speaker representation quality for diarization tasks.

Findings

01

Outperforms baseline x-vector systems in diarization tasks

02

Achieves significant reductions in DER, JER, and WER on benchmark datasets

03

Demonstrates robustness with heterogeneous and overlapping speaker data

Abstract

In traditional speaker diarization systems, a well-trained speaker model is a key component to extract representations from consecutive and partially overlapping segments in a long speech session. To be more consistent with the back-end segmentation and clustering, we propose a new CNN-based speaker modeling scheme, which takes into account the heterogeneity of the speakers in each training segment and batch. We randomly and synthetically augment the training data into a set of segments, each of which contains more than one speaker and some overlapping parts. A soft label is imposed on each segment based on its speaker occupation ratio, and the standard cross entropy loss is implemented in model training. In this way, the speaker model should have the ability to generate a geometrically meaningful embedding for each multi-speaker segment. Experimental results show that our system is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing