Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse   Positive Pairs

Ruijie Tao; Kong Aik Lee; Rohan Kumar Das; Ville Hautam\"aki and; Haizhou Li

arXiv:2210.15385·eess.AS·October 28, 2022·1 cites

Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautam\"aki and, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces a multi-modal contrastive learning approach for speaker recognition that leverages speech and face data to create diverse positive pairs, significantly improving self-supervised speaker encoder performance.

Contribution

It proposes a novel multi-modal sampling strategy for contrastive learning, enhancing the robustness of self-supervised speaker encoders without using speaker labels.

Findings

01

Achieved state-of-the-art EER on VoxCeleb1 with self-supervised training.

02

Outperformed existing self-supervised methods by a large margin.

03

Demonstrated robustness across multiple datasets.

Abstract

We study a novel neural architecture and its training strategies of speaker encoder for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed-size speaker embedding from a spoken utterance of various length. Contrastive learning is a typical self-supervised learning technique. However, the quality of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such poor-man's positive pairs (PPP) lack necessary diversity for the training of a robust encoder. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we study a method that finds diverse positive pairs (DPP) for contrastive learning, thus improving the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsTest · Contrastive Learning