State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data

Sara Barahona; Ladislav Mo\v{s}ner; Themos Stafylakis; Old\v{r}ich Plchot; Junyi Peng; Luk\'a\v{s} Burget; Jan \v{C}ernock\'y

arXiv:2410.02364·eess.AS·December 1, 2025

State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data

Sara Barahona, Ladislav Mo\v{s}ner, Themos Stafylakis, Old\v{r}ich Plchot, Junyi Peng, Luk\'a\v{s} Burget, Jan \v{C}ernock\'y

PDF

Open Access

TL;DR

This paper presents a method for training speaker embeddings using only audio and celebrity names from VoxCeleb videos, achieving state-of-the-art results without needing speaker timestamps or video data.

Contribution

The authors introduce a weakly supervised training approach for speaker embeddings that eliminates the need for speaker timestamps and multimodal alignment, enabling large-scale data utilization.

Findings

01

Achieves state-of-the-art speaker verification performance.

02

Comparable results to fully supervised training on VoxCeleb.

03

Enables training with large-scale weakly labeled speech data.

Abstract

In this paper, we refine and validate our method for training speaker embedding extractors using weak annotations. More specifically, we use only the audio stream of the source VoxCeleb videos and the names of the celebrities without knowing the time intervals in which they appear in the recording. We experiment with hyperparameters and embedding extractors based on ResNet and WavLM. We show that the method achieves state-of-the-art results in speaker verification, comparable with training the extractors in a standard supervised way on the VoxCeleb dataset. We also extend it by considering segments belonging to unknown speakers appearing alongside the celebrities, which are typically discarded. Removing the need for speaker timestamps and multimodal alignment, our method unlocks the use of large-scale weakly labeled speech data, enabling direct training of state-of-the-art embedding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization

MethodsConvolution · Average Pooling · Max Pooling · Kaiming Initialization · Global Average Pooling