End-to-End Trainable Self-Attentive Shallow Network for Text-Independent   Speaker Verification

Hyeonmook Park; Jungbae Park; Sang Wan Lee

arXiv:2008.06146·eess.AS·August 17, 2020·1 cites

End-to-End Trainable Self-Attentive Shallow Network for Text-Independent Speaker Verification

Hyeonmook Park, Jungbae Park, Sang Wan Lee

PDF

Open Access

TL;DR

This paper introduces a self-attentive shallow network for speaker verification that overcomes LSTM limitations, achieving significantly better accuracy and efficiency than the GE2E model, especially with longer input sequences.

Contribution

The paper proposes a novel end-to-end trainable self-attentive shallow network combining TDNN and self-attentive pooling for improved speaker verification.

Findings

01

The proposed model outperforms GE2E in accuracy and efficiency.

02

Significant reduction in model size with comparable or better performance.

03

Enhanced performance with longer input sequences, especially in DCF scores.

Abstract

Generalized end-to-end (GE2E) model is widely used in speaker verification (SV) fields due to its expandability and generality regardless of specific languages. However, the long-short term memory (LSTM) based on GE2E has two limitations: First, the embedding of GE2E suffers from vanishing gradient, which leads to performance degradation for very long input sequences. Secondly, utterances are not represented as a properly fixed dimensional vector. In this paper, to overcome issues mentioned above, we propose a novel framework for SV, end-to-end trainable self-attentive shallow network (SASN), incorporating a time-delay neural network (TDNN) and a self-attentive pooling mechanism based on the self-attentive x-vector system during an utterance embedding phase. We demonstrate that the proposed model is highly efficient, and provides more accurate speaker verification than GE2E. For VCTK…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing