Integrating Frequency Translational Invariance in TDNNs and Frequency   Positional Information in 2D ResNets to Enhance Speaker Verification

Jenthe Thienpondt; Brecht Desplanques; Kris Demuynck

arXiv:2104.02370·eess.AS·September 10, 2021·Interspeech

Integrating Frequency Translational Invariance in TDNNs and Frequency Positional Information in 2D ResNets to Enhance Speaker Verification

Jenthe Thienpondt, Brecht Desplanques, Kris Demuynck

PDF

TL;DR

This paper enhances speaker verification models by integrating frequency invariance and positional encoding into TDNNs and ResNets, leading to significant performance improvements on challenging short-duration and cross-lingual tasks.

Contribution

It introduces a hybrid CNN-TDNN architecture with a 2D convolutional stem and incorporates frequency positional encodings and a frequency-wise SE module into ResNets, advancing speaker verification methods.

Findings

01

Significant performance gains over baseline models on SdSVC-21 data.

02

Achieved third place in the SdSVC-21 Task 2 ranking with a four-system fusion.

03

Improved architectures outperform original models on VoxCeleb1 test set.

Abstract

This paper describes the IDLab submission for the text-independent task of the Short-duration Speaker Verification Challenge 2021 (SdSVC-21). This speaker verification competition focuses on short duration test recordings and cross-lingual trials, along with the constraint of limited availability of in-domain DeepMine Farsi training data. Currently, both Time Delay Neural Networks (TDNNs) and ResNets achieve state-of-the-art results in speaker verification. These architectures are structurally very different and the construction of hybrid networks looks a promising way forward. We introduce a 2D convolutional stem in a strong ECAPA-TDNN baseline to transfer some of the strong characteristics of a ResNet based model to this hybrid CNN-TDNN architecture. Similarly, we incorporate absolute frequency positional encodings in an SE-ResNet34 architecture. These learnable feature map biases…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsResidual Connection · 1x1 Convolution · Residual Block · Batch Normalization · *Communicated@Fast*How Do I Communicate to Expedia? · Bottleneck Residual Block · Max Pooling · Convolution · Kaiming Initialization · Average Pooling