Exploring a Unified Attention-Based Pooling Framework for Speaker   Verification

Yi Liu; Liang He; Weiwei Liu; Jia Liu

arXiv:1808.07120·cs.SD·August 23, 2018

Exploring a Unified Attention-Based Pooling Framework for Speaker Verification

Yi Liu, Liang He, Weiwei Liu, Jia Liu

PDF

Open Access

TL;DR

This paper introduces a unified attention-based pooling framework with multi-head attention for speaker verification, outperforming traditional average pooling by leveraging lower-layer outputs for more discriminative speaker representations.

Contribution

The paper proposes a novel unified attention-based pooling method combined with multi-head attention, enhancing speaker verification performance over existing average pooling techniques.

Findings

01

Attention-based pooling outperforms average pooling.

02

Using lower-layer outputs improves attention quality.

03

Multi-head attention further boosts verification accuracy.

Abstract

The pooling layer is an essential component in the neural network based speaker verification. Most of the current networks in speaker verification use average pooling to derive the utterance-level speaker representations. Average pooling takes every frame as equally important, which is suboptimal since the speaker-discriminant power is different between speech segments. In this paper, we present a unified attention-based pooling framework and combine it with the multi-head attention. Experiments on the Fisher and NIST SRE 2010 dataset show that involving outputs from lower layers to compute the attention weights can outperform average pooling and achieve better results than vanilla attention method. The multi-head attention further improves the performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing