Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding
Hongning Zhu, Kong Aik Lee, Haizhou Li

TL;DR
This paper introduces a serialized multi-layer multi-head attention mechanism for neural speaker embedding, leveraging hierarchical self-attention to improve speaker verification accuracy.
Contribution
It proposes a novel serialized multi-layer multi-head attention architecture that propagates attentive features across layers for more discriminative speaker embeddings.
Findings
Outperforms baseline methods on VoxCeleb1 and SITW datasets.
Achieves 9.7% relative improvement in EER.
Achieves 8.1% relative improvement in DCF0.01.
Abstract
This paper proposes a serialized multi-layer multi-head attention for neural speaker embedding in text-independent speaker verification. In prior works, frame-level features from one layer are aggregated to form an utterance-level representation. Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms to derive refined features that are more correlated with speakers. Serialized attention mechanism contains a stack of self-attention modules to create fixed-dimensional representations of speakers. Instead of utilizing multi-head attention in parallel, the proposed serialized multi-layer multi-head attention is designed to aggregate and propagate attentive statistics from one layer to the next in a serialized manner. In addition, we employ an input-aware query for each utterance with the statistics pooling. With…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Label Smoothing · Residual Connection · Dense Connections · Softmax · Layer Normalization
