Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?
Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Zhuo Chen, Peidong, Wang, Gang Liu, Jinyu Li, Jian Wu, Xiangzhan Yu, Furu Wei

TL;DR
This paper investigates why self-supervised learning improves speaker recognition, revealing that mask speech prediction, data scale, and model size are key factors, with insights gained through attribution and visualization methods.
Contribution
It identifies the main factors contributing to SSL's success in speaker recognition and provides a detailed analysis of their relative impacts.
Findings
Mask speech prediction loss enhances speaker recognition.
Larger data and model sizes improve SSL performance.
SSL quantizer has minimal effect on speaker recognition.
Abstract
Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
