Self-attention encoding and pooling for speaker recognition
Pooyan Safari, Miquel India, Javier Hernando

TL;DR
This paper introduces a self-attention based encoding and pooling mechanism for speaker recognition that achieves high performance with significantly fewer parameters, making it suitable for resource-constrained devices.
Contribution
The paper proposes a novel Self-Attention Encoding and Pooling (SAEP) method that outperforms traditional models like x-vector with fewer parameters, enhancing efficiency in speaker verification.
Findings
Outperforms baseline x-vector in speaker verification tasks
Reduces model size by up to 95% compared to ResNet models
Achieves competitive results on VoxCeleb datasets
Abstract
The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong performance on a variety of Natural Language Processing (NLP) applications. Inspired by the Transformer, we propose a tandem Self-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP is a stack of identical blocks solely relied on self-attention and position-wise feed-forward networks to create vector representation of speakers. This approach encodes short-term speaker spectral features into speaker embeddings to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Attention Is All You Need · Label Smoothing · Dropout · Adam · Multi-Head Attention · Softmax
