Self-attention encoding and pooling for speaker recognition

Pooyan Safari; Miquel India; Javier Hernando

arXiv:2008.01077·eess.AS·August 5, 2020

Self-attention encoding and pooling for speaker recognition

Pooyan Safari, Miquel India, Javier Hernando

PDF

TL;DR

This paper introduces a self-attention based encoding and pooling mechanism for speaker recognition that achieves high performance with significantly fewer parameters, making it suitable for resource-constrained devices.

Contribution

The paper proposes a novel Self-Attention Encoding and Pooling (SAEP) method that outperforms traditional models like x-vector with fewer parameters, enhancing efficiency in speaker verification.

Findings

01

Outperforms baseline x-vector in speaker verification tasks

02

Reduces model size by up to 95% compared to ResNet models

03

Achieves competitive results on VoxCeleb datasets

Abstract

The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong performance on a variety of Natural Language Processing (NLP) applications. Inspired by the Transformer, we propose a tandem Self-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP is a stack of identical blocks solely relied on self-attention and position-wise feed-forward networks to create vector representation of speakers. This approach encodes short-term speaker spectral features into speaker embeddings to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Attention Is All You Need · Label Smoothing · Dropout · Adam · Multi-Head Attention · Softmax