Self-Attentive Multi-Layer Aggregation with Feature Recalibration and   Normalization for End-to-End Speaker Verification System

Soonshin Seo; Ji-Hwan Kim

arXiv:2007.13350·eess.AS·July 29, 2020

Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Normalization for End-to-End Speaker Verification System

Soonshin Seo, Ji-Hwan Kim

PDF

Open Access

TL;DR

This paper introduces a self-attentive multi-layer aggregation method with feature recalibration and normalization to improve end-to-end speaker verification, reducing parameters and controlling variability for better performance.

Contribution

It proposes a novel self-attentive aggregation with feature recalibration and normalization, enhancing speaker embedding quality while reducing model complexity.

Findings

01

Achieved comparable performance to state-of-the-art models on VoxCeleb datasets.

02

Reduced model parameters using ResNet architecture.

03

Improved robustness through self-attention and normalization techniques.

Abstract

One of the most important parts of an end-to-end speaker verification system is the speaker embedding generation. In our previous paper, we reported that shortcut connections-based multi-layer aggregation improves the representational power of the speaker embedding. However, the number of model parameters is relatively large and the unspecified variations increase in the multi-layer aggregation. Therefore, we propose a self-attentive multi-layer aggregation with feature recalibration and normalization for end-to-end speaker verification system. To reduce the number of model parameters, the ResNet, which scaled channel width and layer depth, is used as a baseline. To control the variability in the training, a self-attention mechanism is applied to perform the multi-layer aggregation with dropout regularizations and batch normalizations. Then, a feature recalibration layer is applied to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsConvolution · Residual Block · Average Pooling · 1x1 Convolution · Residual Connection · Global Average Pooling · Kaiming Initialization · *Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Max Pooling