STC speaker recognition systems for the NIST SRE 2021
Anastasia Avdeeva, Aleksei Gusev, Igor Korsunov, Alexander Kozlov,, Galina Lavrentyeva, Sergey Novoselov, Timur Pekhovsky, Andrey Shulipa, Alisa, Vinogradova, Vladimir Volokhov, Evgeny Smirnov, Vasily Galyuk

TL;DR
This paper describes STC Ltd.'s speaker recognition systems for NIST SRE 2021, utilizing deep neural networks, wav2vec 2.0 features, and multimodal fusion, achieving state-of-the-art results in fixed and open training conditions.
Contribution
It introduces novel use of wav2vec 2.0 features and deep neural network architectures like ResNets and ECAPA for speaker recognition, with effective fusion and calibration techniques.
Findings
Wav2vec 2.0 fine-tuning yields best open condition performance.
Unsupervised pretraining with Contrastive Predictive Coding enhances transformer-based extractors.
Multimodal fusion improves overall speaker recognition accuracy.
Abstract
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation for both fixed and open training conditions. These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors. During the NIST 2021 SRE challenge we focused on the training of the state-of-the-art deep speaker embeddings extractors like ResNets and ECAPA networks by using additive angular margin based loss functions. Additionally, inspired by the recent success of the wav2vec 2.0 features in automatic speech recognition we explored the effectiveness of this approach for the speaker verification filed. According to our observation the fine-tuning of the pretrained large wav2vec 2.0 model provides our best performing systems for open track condition. Our experiments with wav2vec 2.0 based extractors for the fixed condition showed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsAverage Pooling · Global Average Pooling · Batch Normalization · Residual Connection · InfoNCE · 1x1 Convolution · Bottleneck Residual Block · *Communicated@Fast*How Do I Communicate to Expedia? · Convolution · Residual Block
