Class Token and Knowledge Distillation for Multi-head Self-Attention Speaker Verification Systems
Victoria Mingote, Antonio Miguel, Alfonso Ortega, Eduardo Lleida

TL;DR
This paper introduces novel methods for speaker verification using deep neural networks with multi-head self-attention, including a learnable class token, Bayesian estimation, and knowledge distillation, leading to improved robustness and performance.
Contribution
It proposes a learnable class token and a knowledge distillation framework for multi-head self-attention based speaker verification, enhancing temporal structure modeling and robustness.
Findings
Competitive results on RSR2015-Part II and DeepMine-Part 1 datasets.
Improved robustness with Bayesian estimation and distillation tokens.
Outperforms average pooling in embedding extraction.
Abstract
This paper explores three novel approaches to improve the performance of speaker verification (SV) systems based on deep neural networks (DNN) using Multi-head Self-Attention (MSA) mechanisms and memory layers. Firstly, we propose the use of a learnable vector called Class token to replace the average global pooling mechanism to extract the embeddings. Unlike global average pooling, our proposal takes into account the temporal structure of the input what is relevant for the text-dependent SV task. The class token is concatenated to the input before the first MSA layer, and its state at the output is used to predict the classes. To gain additional robustness, we introduce two approaches. First, we have developed a Bayesian estimation of the class token. Second, we have added a distilled representation token for training a teacher-student pair of networks using the Knowledge Distillation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
MethodsKnowledge Distillation · Average Pooling
