The Phonexia VoxCeleb Speaker Recognition Challenge 2021 System Description
Josef Slav\'i\v{c}ek, Albert Swart, Michal Kl\v{c}o, Niko, Br\"ummer

TL;DR
This paper details the Phonexia system for the VoxCeleb Speaker Recognition Challenge 2021, utilizing unsupervised learning, clustering, and score fusion to improve speaker verification accuracy.
Contribution
It introduces a novel unsupervised speaker verification approach combining contrastive learning, clustering, and ensemble scoring, inspired by prior winning methods.
Findings
Effective unsupervised embedding extraction via contrastive learning
Iterative clustering improves pseudo-label quality
Score fusion enhances verification performance
Abstract
We describe the Phonexia submission for the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21) in the unsupervised speaker verification track. Our solution was very similar to IDLab's winning submission for VoxSRC-20. An embedding extractor was bootstrapped using momentum contrastive learning, with input augmentations as the only source of supervision. This was followed by several iterations of clustering to assign pseudo-speaker labels that were then used for supervised embedding extractor training. Finally, a score fusion was done, by averaging the zt-normalized cosine scores of five different embedding extractors. We briefly also describe unsuccessful solutions involving i-vectors instead of DNN embeddings and PLDA instead of cosine scoring.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
