Leveraging ASR Pretrained Conformers for Speaker Verification through Transfer Learning and Knowledge Distillation
Danwei Cai, Ming Li

TL;DR
This paper demonstrates how to adapt ASR-pretrained Conformers for speaker verification using transfer learning, knowledge distillation, and a lightweight adaptor, achieving significant improvements on VoxCeleb.
Contribution
It introduces three novel strategies to transfer ASR Conformer knowledge to speaker verification, enhancing performance and efficiency.
Findings
Transfer learning reduces EER to 0.48%.
Knowledge distillation achieves 0.43% EER.
Lightweight adaptor attains 0.57% EER with minimal parameter increase.
Abstract
This paper explores the use of ASR-pretrained Conformers for speaker verification, leveraging their strengths in modeling speech signals. We introduce three strategies: (1) Transfer learning to initialize the speaker embedding network, improving generalization and reducing overfitting. (2) Knowledge distillation to train a more flexible speaker verification model, incorporating frame-level ASR loss as an auxiliary task. (3) A lightweight speaker adaptor for efficient feature conversion without altering the original ASR Conformer, allowing parallel ASR and speaker verification. Experiments on VoxCeleb show significant improvements: transfer learning yields a 0.48% EER, knowledge distillation results in a 0.43% EER, and the speaker adaptor approach, with just an added 4.92M parameters to a 130.94M-parameter model, achieves a 0.57% EER. Overall, our methods effectively transfer ASR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsKnowledge Distillation
