Why does Self-Supervised Learning for Speech Recognition Benefit Speaker   Recognition?

Sanyuan Chen; Yu Wu; Chengyi Wang; Shujie Liu; Zhuo Chen; Peidong; Wang; Gang Liu; Jinyu Li; Jian Wu; Xiangzhan Yu; Furu Wei

arXiv:2204.12765·cs.CL·June 28, 2022

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Zhuo Chen, Peidong, Wang, Gang Liu, Jinyu Li, Jian Wu, Xiangzhan Yu, Furu Wei

PDF

Open Access

TL;DR

This paper investigates why self-supervised learning improves speaker recognition, revealing that mask speech prediction, data scale, and model size are key factors, with insights gained through attribution and visualization methods.

Contribution

It identifies the main factors contributing to SSL's success in speaker recognition and provides a detailed analysis of their relative impacts.

Findings

01

Mask speech prediction loss enhances speaker recognition.

02

Larger data and model sizes improve SSL performance.

03

SSL quantizer has minimal effect on speaker recognition.

Abstract

Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing