Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances
Aleksei Gusev, Vladimir Volokhov, Tseren Andzhukaev, Sergey Novoselov,, Galina Lavrentyeva, Marina Volkova, Alice Gazizullina, Andrey Shulipa, Artem, Gorlanov, Anastasia Avdeeva, Artem Ivanov, Alexander Kozlov, Timur Pekhovsky,, Yuri Matveev

TL;DR
This paper explores deep neural network architectures, specifically TDNN and ResNet, to improve far-field speaker verification accuracy in noisy, reverberant environments, especially for short utterances, demonstrating ResNet's superior performance.
Contribution
It introduces ResNet-based speaker embedding extractors and training methods that outperform traditional x-vector models in challenging conditions for both long and short utterances.
Findings
ResNet architectures outperform x-vector in verification quality.
ResNet maintains high performance with short utterances.
Various techniques like speech activity detection and score normalization enhance results.
Abstract
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions according to the results obtained for early NIST SRE (Speaker Recognition Evaluation) datasets. From the practical point of view, taking into account the increased interest in virtual assistants (such as Amazon Alexa, Google Home, AppleSiri, etc.), speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances. For these purposes, we considered deep neural network architectures based on TDNN (TimeDelay Neural Network) and ResNet (Residual Neural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsAverage Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling · Residual Connection
