Leveraging ASR Pretrained Conformers for Speaker Verification through   Transfer Learning and Knowledge Distillation

Danwei Cai; Ming Li

arXiv:2309.03019·eess.AS·July 17, 2024·IEEE ACM Trans. Audio Speech Lang. Process.

Leveraging ASR Pretrained Conformers for Speaker Verification through Transfer Learning and Knowledge Distillation

Danwei Cai, Ming Li

PDF

Open Access

TL;DR

This paper demonstrates how to adapt ASR-pretrained Conformers for speaker verification using transfer learning, knowledge distillation, and a lightweight adaptor, achieving significant improvements on VoxCeleb.

Contribution

It introduces three novel strategies to transfer ASR Conformer knowledge to speaker verification, enhancing performance and efficiency.

Findings

01

Transfer learning reduces EER to 0.48%.

02

Knowledge distillation achieves 0.43% EER.

03

Lightweight adaptor attains 0.57% EER with minimal parameter increase.

Abstract

This paper explores the use of ASR-pretrained Conformers for speaker verification, leveraging their strengths in modeling speech signals. We introduce three strategies: (1) Transfer learning to initialize the speaker embedding network, improving generalization and reducing overfitting. (2) Knowledge distillation to train a more flexible speaker verification model, incorporating frame-level ASR loss as an auxiliary task. (3) A lightweight speaker adaptor for efficient feature conversion without altering the original ASR Conformer, allowing parallel ASR and speaker verification. Experiments on VoxCeleb show significant improvements: transfer learning yields a 0.48% EER, knowledge distillation results in a 0.43% EER, and the speaker adaptor approach, with just an added 4.92M parameters to a 130.94M-parameter model, achieves a 0.57% EER. Overall, our methods effectively transfer ASR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsKnowledge Distillation