State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data
Sara Barahona, Ladislav Mo\v{s}ner, Themos Stafylakis, Old\v{r}ich Plchot, Junyi Peng, Luk\'a\v{s} Burget, Jan \v{C}ernock\'y

TL;DR
This paper presents a method for training speaker embeddings using only audio and celebrity names from VoxCeleb videos, achieving state-of-the-art results without needing speaker timestamps or video data.
Contribution
The authors introduce a weakly supervised training approach for speaker embeddings that eliminates the need for speaker timestamps and multimodal alignment, enabling large-scale data utilization.
Findings
Achieves state-of-the-art speaker verification performance.
Comparable results to fully supervised training on VoxCeleb.
Enables training with large-scale weakly labeled speech data.
Abstract
In this paper, we refine and validate our method for training speaker embedding extractors using weak annotations. More specifically, we use only the audio stream of the source VoxCeleb videos and the names of the celebrities without knowing the time intervals in which they appear in the recording. We experiment with hyperparameters and embedding extractors based on ResNet and WavLM. We show that the method achieves state-of-the-art results in speaker verification, comparable with training the extractors in a standard supervised way on the VoxCeleb dataset. We also extend it by considering segments belonging to unknown speakers appearing alongside the celebrities, which are typically discarded. Removing the need for speaker timestamps and multimodal alignment, our method unlocks the use of large-scale weakly labeled speech data, enabling direct training of state-of-the-art embedding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
MethodsConvolution · Average Pooling · Max Pooling · Kaiming Initialization · Global Average Pooling
