Class Token and Knowledge Distillation for Multi-head Self-Attention   Speaker Verification Systems

Victoria Mingote; Antonio Miguel; Alfonso Ortega; Eduardo Lleida

arXiv:2111.03842·eess.AS·February 13, 2023·1 cites

Class Token and Knowledge Distillation for Multi-head Self-Attention Speaker Verification Systems

Victoria Mingote, Antonio Miguel, Alfonso Ortega, Eduardo Lleida

PDF

Open Access

TL;DR

This paper introduces novel methods for speaker verification using deep neural networks with multi-head self-attention, including a learnable class token, Bayesian estimation, and knowledge distillation, leading to improved robustness and performance.

Contribution

It proposes a learnable class token and a knowledge distillation framework for multi-head self-attention based speaker verification, enhancing temporal structure modeling and robustness.

Findings

01

Competitive results on RSR2015-Part II and DeepMine-Part 1 datasets.

02

Improved robustness with Bayesian estimation and distillation tokens.

03

Outperforms average pooling in embedding extraction.

Abstract

This paper explores three novel approaches to improve the performance of speaker verification (SV) systems based on deep neural networks (DNN) using Multi-head Self-Attention (MSA) mechanisms and memory layers. Firstly, we propose the use of a learnable vector called Class token to replace the average global pooling mechanism to extract the embeddings. Unlike global average pooling, our proposal takes into account the temporal structure of the input what is relevant for the text-dependent SV task. The class token is concatenated to the input before the first MSA layer, and its state at the output is used to predict the classes. To gain additional robustness, we introduce two approaches. First, we have developed a Bayesian estimation of the class token. Second, we have added a distilled representation token for training a teacher-student pair of networks using the Knowledge Distillation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing

MethodsKnowledge Distillation · Average Pooling