A comprehensive study on self-supervised distillation for speaker   representation learning

Zhengyang Chen; Yao Qian; Bing Han; Yanmin Qian; Michael Zeng

arXiv:2210.15936·cs.SD·November 28, 2022·1 cites

A comprehensive study on self-supervised distillation for speaker representation learning

Zhengyang Chen, Yao Qian, Bing Han, Yanmin Qian, Michael Zeng

PDF

Open Access

TL;DR

This paper investigates self-supervised distillation methods for speaker representation learning, emphasizing data augmentation techniques, and achieves state-of-the-art results on the Voxceleb1 benchmark without using speaker labels.

Contribution

It introduces a novel audio perturbation augmentation strategy that enhances self-distilled self-supervised speaker representation learning.

Findings

01

Achieved new state-of-the-art EER on Voxceleb1 benchmark

02

Demonstrated effectiveness of data augmentation in self-supervised learning

03

Model discards speaker labels during training

Abstract

In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In this paper, we present a comprehensive study on self-distilled self-supervised speaker representation learning, especially on critical data augmentation. Our proposed strategy of audio perturbation augmentation has pushed the performance of the speaker representation to a new limit. The experimental results show that our model can achieve a new SoTA on Voxceleb1 speaker verification evaluation benchmark ( i.e., equal error rate (EER) 2.505%, 2.473%, and 4.791% for trial Vox1-O, Vox1-E and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing