TL;DR
This paper introduces Double Multi-Head Attention pooling, an enhanced method for speaker verification that improves the discriminative power of speaker embeddings by adding an extra self-attention layer, leading to better performance on VoxCeleb2.
Contribution
The paper proposes a novel Double Multi-Head Attention pooling mechanism that extends previous self-attention approaches for more effective speaker embedding extraction.
Findings
Achieved 6.09% relative EER reduction over Self Attention pooling.
Achieved 5.23% relative EER reduction over Self Multi-Head Attention.
Demonstrated improved feature selection for CNN-based front-ends.
Abstract
Most state-of-the-art Deep Learning systems for speaker verification are based on speaker embedding extractors. These architectures are commonly composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. In this paper we present Double Multi-Head Attention pooling, which extends our previous approach based on Self Multi-Head Attention. An additional self attention layer is added to the pooling layer that summarizes the context vectors produced by Multi-Head Attention into a unique speaker representation. This method enhances the pooling mechanism by giving weights to the information captured for each head and it results in creating more discriminative speaker embeddings. We have evaluated our approach with the VoxCeleb2 dataset. Our results show 6.09% and 5.23% relative improvement in terms of EER…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
