Deep Speaker Embeddings for Far-Field Speaker Recognition on Short   Utterances

Aleksei Gusev; Vladimir Volokhov; Tseren Andzhukaev; Sergey Novoselov,; Galina Lavrentyeva; Marina Volkova; Alice Gazizullina; Andrey Shulipa; Artem; Gorlanov; Anastasia Avdeeva; Artem Ivanov; Alexander Kozlov; Timur Pekhovsky,; Yuri Matveev

arXiv:2002.06033·cs.SD·February 17, 2020·5 cites

Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances

Aleksei Gusev, Vladimir Volokhov, Tseren Andzhukaev, Sergey Novoselov,, Galina Lavrentyeva, Marina Volkova, Alice Gazizullina, Andrey Shulipa, Artem, Gorlanov, Anastasia Avdeeva, Artem Ivanov, Alexander Kozlov, Timur Pekhovsky,, Yuri Matveev

PDF

Open Access

TL;DR

This paper explores deep neural network architectures, specifically TDNN and ResNet, to improve far-field speaker verification accuracy in noisy, reverberant environments, especially for short utterances, demonstrating ResNet's superior performance.

Contribution

It introduces ResNet-based speaker embedding extractors and training methods that outperform traditional x-vector models in challenging conditions for both long and short utterances.

Findings

01

ResNet architectures outperform x-vector in verification quality.

02

ResNet maintains high performance with short utterances.

03

Various techniques like speech activity detection and score normalization enhance results.

Abstract

Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions according to the results obtained for early NIST SRE (Speaker Recognition Evaluation) datasets. From the practical point of view, taking into account the increased interest in virtual assistants (such as Amazon Alexa, Google Home, AppleSiri, etc.), speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances. For these purposes, we considered deep neural network architectures based on TDNN (TimeDelay Neural Network) and ResNet (Residual Neural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsAverage Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling · Residual Connection