Training Speaker Embedding Extractors Using Multi-Speaker Audio with   Unknown Speaker Boundaries

Themos Stafylakis; Ladislav Mo\v{s}ner; Old\v{r}ich Plchot; Johan; Rohdin; Anna Silnova; Luk\'a\v{s} Burget; Jan "Honza'' \v{C}ernock\'y

arXiv:2203.15436·eess.AS·August 10, 2022·Interspeech

Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries

Themos Stafylakis, Ladislav Mo\v{s}ner, Old\v{r}ich Plchot, Johan, Rohdin, Anna Silnova, Luk\'a\v{s} Burget, Jan "Honza'' \v{C}ernock\'y

PDF

Open Access

TL;DR

This paper presents a novel method for training speaker embedding extractors using weakly annotated multi-speaker audio data, leveraging diarization and aggregation techniques to handle unknown speaker boundaries.

Contribution

It introduces a combined approach using a baseline diarization, modified loss, and two-stage training to effectively train speaker embeddings without precise segment annotations.

Findings

01

Achieved competitive speaker embedding performance with weakly labeled data.

02

Analyzed the impact of different aggregation functions on training dynamics.

03

Demonstrated the effectiveness of the proposed method on VoxCeleb recordings.

Abstract

In this paper, we demonstrate a method for training speaker embedding extractors using weak annotation. More specifically, we are using the full VoxCeleb recordings and the name of the celebrities appearing on each video without knowledge of the time intervals the celebrities appear in the video. We show that by combining a baseline speaker diarization algorithm that requires no training or parameter tuning, a modified loss with aggregation over segments, and a two-stage training approach, we are able to train a competitive ResNet-based embedding extractor. Finally, we experiment with two different aggregation functions and analyze their behaviour in terms of their gradients.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing