Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding

Hongning Zhu; Kong Aik Lee; Haizhou Li

arXiv:2107.06493·cs.SD·July 15, 2021·1 cites

Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding

Hongning Zhu, Kong Aik Lee, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces a serialized multi-layer multi-head attention mechanism for neural speaker embedding, leveraging hierarchical self-attention to improve speaker verification accuracy.

Contribution

It proposes a novel serialized multi-layer multi-head attention architecture that propagates attentive features across layers for more discriminative speaker embeddings.

Findings

01

Outperforms baseline methods on VoxCeleb1 and SITW datasets.

02

Achieves 9.7% relative improvement in EER.

03

Achieves 8.1% relative improvement in DCF0.01.

Abstract

This paper proposes a serialized multi-layer multi-head attention for neural speaker embedding in text-independent speaker verification. In prior works, frame-level features from one layer are aggregated to form an utterance-level representation. Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms to derive refined features that are more correlated with speakers. Serialized attention mechanism contains a stack of self-attention modules to create fixed-dimensional representations of speakers. Instead of utilizing multi-head attention in parallel, the proposed serialized multi-layer multi-head attention is designed to aggregate and propagate attentive statistics from one layer to the next in a serialized manner. In addition, we employ an input-aware query for each utterance with the statistics pooling. With…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Label Smoothing · Residual Connection · Dense Connections · Softmax · Layer Normalization