Momentum Contrast Speaker Representation Learning

Jangho Lee; Jaihyun Koh; Sungroh Yoon

arXiv:2010.11457·eess.AS·October 23, 2020·1 cites

Momentum Contrast Speaker Representation Learning

Jangho Lee, Jaihyun Koh, Sungroh Yoon

PDF

Open Access

TL;DR

This paper introduces MoCoVox, an unsupervised contrastive learning method for speaker representation that outperforms existing approaches in speaker verification, demonstrating the effectiveness of contrastive learning in speech.

Contribution

It extends Momentum Contrastive learning to speech, proposing MoCoVox for unsupervised speaker representation learning and showing its superiority in verification tasks.

Findings

01

MoCoVox outperforms state-of-the-art metric learning methods.

02

Contrastive learning effectively captures speaker features.

03

Unsupervised learning aids open-set speaker recognition.

Abstract

Unsupervised representation learning has shown remarkable achievement by reducing the performance gap with supervised feature learning, especially in the image domain. In this study, to extend the technique of unsupervised learning to the speech domain, we propose the Momentum Contrast for VoxCeleb (MoCoVox) as a form of learning mechanism. We pre-trained the MoCoVox on the VoxCeleb1 by implementing instance discrimination. Applying MoCoVox for speaker verification revealed that it outperforms the state-of-the-art metric learning-based approach by a large margin. We also empirically demonstrate the features of contrastive learning in the speech domain by analyzing the distribution of learned representations. Furthermore, we explored which pretext task is adequate for speaker verification. We expect that learning speaker representation without human supervision helps to address the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsContrastive Learning