Momentum Contrast Speaker Representation Learning
Jangho Lee, Jaihyun Koh, Sungroh Yoon

TL;DR
This paper introduces MoCoVox, an unsupervised contrastive learning method for speaker representation that outperforms existing approaches in speaker verification, demonstrating the effectiveness of contrastive learning in speech.
Contribution
It extends Momentum Contrastive learning to speech, proposing MoCoVox for unsupervised speaker representation learning and showing its superiority in verification tasks.
Findings
MoCoVox outperforms state-of-the-art metric learning methods.
Contrastive learning effectively captures speaker features.
Unsupervised learning aids open-set speaker recognition.
Abstract
Unsupervised representation learning has shown remarkable achievement by reducing the performance gap with supervised feature learning, especially in the image domain. In this study, to extend the technique of unsupervised learning to the speech domain, we propose the Momentum Contrast for VoxCeleb (MoCoVox) as a form of learning mechanism. We pre-trained the MoCoVox on the VoxCeleb1 by implementing instance discrimination. Applying MoCoVox for speaker verification revealed that it outperforms the state-of-the-art metric learning-based approach by a large margin. We also empirically demonstrate the features of contrastive learning in the speech domain by analyzing the distribution of learned representations. Furthermore, we explored which pretext task is adequate for speaker verification. We expect that learning speaker representation without human supervision helps to address the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsContrastive Learning
