Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries
Themos Stafylakis, Ladislav Mo\v{s}ner, Old\v{r}ich Plchot, Johan, Rohdin, Anna Silnova, Luk\'a\v{s} Burget, Jan "Honza'' \v{C}ernock\'y

TL;DR
This paper presents a novel method for training speaker embedding extractors using weakly annotated multi-speaker audio data, leveraging diarization and aggregation techniques to handle unknown speaker boundaries.
Contribution
It introduces a combined approach using a baseline diarization, modified loss, and two-stage training to effectively train speaker embeddings without precise segment annotations.
Findings
Achieved competitive speaker embedding performance with weakly labeled data.
Analyzed the impact of different aggregation functions on training dynamics.
Demonstrated the effectiveness of the proposed method on VoxCeleb recordings.
Abstract
In this paper, we demonstrate a method for training speaker embedding extractors using weak annotation. More specifically, we are using the full VoxCeleb recordings and the name of the celebrities appearing on each video without knowledge of the time intervals the celebrities appear in the video. We show that by combining a baseline speaker diarization algorithm that requires no training or parameter tuning, a modified loss with aggregation over segments, and a two-stage training approach, we are able to train a competitive ResNet-based embedding extractor. Finally, we experiment with two different aggregation functions and analyze their behaviour in terms of their gradients.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
