Exploring a Unified Attention-Based Pooling Framework for Speaker Verification
Yi Liu, Liang He, Weiwei Liu, Jia Liu

TL;DR
This paper introduces a unified attention-based pooling framework with multi-head attention for speaker verification, outperforming traditional average pooling by leveraging lower-layer outputs for more discriminative speaker representations.
Contribution
The paper proposes a novel unified attention-based pooling method combined with multi-head attention, enhancing speaker verification performance over existing average pooling techniques.
Findings
Attention-based pooling outperforms average pooling.
Using lower-layer outputs improves attention quality.
Multi-head attention further boosts verification accuracy.
Abstract
The pooling layer is an essential component in the neural network based speaker verification. Most of the current networks in speaker verification use average pooling to derive the utterance-level speaker representations. Average pooling takes every frame as equally important, which is suboptimal since the speaker-discriminant power is different between speech segments. In this paper, we present a unified attention-based pooling framework and combine it with the multi-head attention. Experiments on the Fisher and NIST SRE 2010 dataset show that involving outputs from lower layers to compute the attention weights can outperform average pooling and achieve better results than vanilla attention method. The multi-head attention further improves the performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
