Generation of Speaker Representations Using Heterogeneous Training Batch Assembly
Yu-Huai Peng, Hung-Shin Lee, Pin-Tuan Huang, Hsin-Min Wang

TL;DR
This paper introduces a CNN-based speaker modeling approach that accounts for heterogeneity in training data, improving diarization performance by generating meaningful embeddings for multi-speaker segments.
Contribution
It proposes a novel training scheme with synthetic multi-speaker segments and soft labels, enhancing speaker representation quality for diarization tasks.
Findings
Outperforms baseline x-vector systems in diarization tasks
Achieves significant reductions in DER, JER, and WER on benchmark datasets
Demonstrates robustness with heterogeneous and overlapping speaker data
Abstract
In traditional speaker diarization systems, a well-trained speaker model is a key component to extract representations from consecutive and partially overlapping segments in a long speech session. To be more consistent with the back-end segmentation and clustering, we propose a new CNN-based speaker modeling scheme, which takes into account the heterogeneity of the speakers in each training segment and batch. We randomly and synthetically augment the training data into a set of segments, each of which contains more than one speaker and some overlapping parts. A soft label is imposed on each segment based on its speaker occupation ratio, and the standard cross entropy loss is implemented in model training. In this way, the speaker model should have the ability to generate a geometrically meaningful embedding for each multi-speaker segment. Experimental results show that our system is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
