Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances
Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi

TL;DR
This paper introduces a novel attention-based back-end model for speaker verification that effectively utilizes multiple enrollment utterances, improving accuracy over traditional methods like PLDA and cosine similarity across various datasets.
Contribution
The paper proposes a new attention back-end model employing scaled-dot and feed-forward self-attention networks for better intra-relationship learning among enrollment utterances in speaker verification.
Findings
Lower EER and minDCF scores on CNCeleb with multiple enrollments
Effective for both text-independent and text-dependent verification
Applicable even with a single enrollment utterance
Abstract
Probabilistic linear discriminant analysis (PLDA) or cosine similarity have been widely used in traditional speaker verification systems as back-end techniques to measure pairwise similarities. To make better use of multiple enrollment utterances, we propose a novel attention back-end model, which can be used for both text-independent (TI) and text-dependent (TD) speaker verification, and employ scaled-dot self-attention and feed-forward self-attention networks as architectures that learn the intra-relationships of the enrollment utterances. In order to verify the proposed attention back-end, we conduct a series of experiments on CNCeleb and VoxCeleb datasets by combining it with several sate-of-the-art speaker encoders including TDNN and ResNet. Experimental results using multiple enrollment utterances on CNCeleb show that the proposed attention back-end model leads to lower EER and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsResidual Connection · 1x1 Convolution · Average Pooling · Residual Block · Batch Normalization · *Communicated@Fast*How Do I Communicate to Expedia? · Bottleneck Residual Block · Max Pooling · Convolution · Kaiming Initialization
