Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs
Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautam\"aki and, Haizhou Li

TL;DR
This paper introduces a multi-modal contrastive learning approach for speaker recognition that leverages speech and face data to create diverse positive pairs, significantly improving self-supervised speaker encoder performance.
Contribution
It proposes a novel multi-modal sampling strategy for contrastive learning, enhancing the robustness of self-supervised speaker encoders without using speaker labels.
Findings
Achieved state-of-the-art EER on VoxCeleb1 with self-supervised training.
Outperformed existing self-supervised methods by a large margin.
Demonstrated robustness across multiple datasets.
Abstract
We study a novel neural architecture and its training strategies of speaker encoder for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed-size speaker embedding from a spoken utterance of various length. Contrastive learning is a typical self-supervised learning technique. However, the quality of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such poor-man's positive pairs (PPP) lack necessary diversity for the training of a robust encoder. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we study a method that finds diverse positive pairs (DPP) for contrastive learning, thus improving the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsTest · Contrastive Learning
