Attention and DCT based Global Context Modeling for Text-independent   Speaker Recognition

Wei Xia; John H.L. Hansen

arXiv:2208.02778·eess.AS·August 25, 2023

Attention and DCT based Global Context Modeling for Text-independent Speaker Recognition

Wei Xia, John H.L. Hansen

PDF

Open Access

TL;DR

This paper introduces a novel global context modeling approach combining attention mechanisms and DCT techniques to enhance speaker recognition by capturing long-range dependencies and improving feature representation.

Contribution

It proposes a comprehensive global time-frequency context modeling block integrating attention and DCT-based methods for more robust speaker verification.

Findings

01

Significant performance improvement over standard ResNet and Squeeze & Excitation models.

02

Effective global context representation enhances speaker verification accuracy.

03

Multi-DCT attention mechanism boosts modeling capacity.

Abstract

Learning an effective speaker representation is crucial for achieving reliable performance in speaker verification tasks. Speech signals are high-dimensional, long, and variable-length sequences containing diverse information at each time-frequency (TF) location. The standard convolutional layer that operates on neighboring local regions often fails to capture the complex TF global information. Our motivation is to alleviate these challenges by increasing the modeling capacity, emphasizing significant information, and suppressing possible redundancies. We aim to design a more robust and efficient speaker recognition system by incorporating the benefits of attention mechanisms and Discrete Cosine Transform (DCT) based signal processing techniques, to effectively represent the global information in speech signals. To achieve this, we propose a general global time-frequency context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · 1x1 Convolution · Residual Connection · Residual Block · Bottleneck Residual Block · Average Pooling · Global Average Pooling · Max Pooling · Kaiming Initialization