The 2021 NIST Speaker Recognition Evaluation
Seyed Omid Sadjadi, Craig Greenberg, Elliot Singer, Lisa, Mason, Douglas Reynolds

TL;DR
The 2021 NIST Speaker Recognition Evaluation introduced new multimodal and multilingual challenges, assessed system performance across audio and visual modalities, and demonstrated the effectiveness of neural network architectures and data augmentation techniques.
Contribution
This paper provides a comprehensive overview of the SRE21 evaluation, highlighting new challenges, data, and the performance of various systems in a large-scale multimodal speaker recognition context.
Findings
Audio-visual fusion improves performance significantly.
Neural network architectures like ResNet enhance recognition accuracy.
Complex training techniques contribute to performance gains.
Abstract
The 2021 Speaker Recognition Evaluation (SRE21) was the latest cycle of the ongoing evaluation series conducted by the U.S. National Institute of Standards and Technology (NIST) since 1996. It was the second large-scale multimodal speaker/person recognition evaluation organized by NIST (the first one being SRE19). Similar to SRE19, it featured two core evaluation tracks, namely audio and audio-visual, as well as an optional visual track. In addition to offering fixed and open training conditions, it also introduced new challenges for the community, thanks to a new multimodal (i.e., audio, video, and selfie images) and multilingual (i.e., with multilingual speakers) corpus, termed WeCanTalk, collected outside North America by the Linguistic Data Consortium (LDC). These challenges included: 1) trials (target and non-target) with enrollment and test segments originating from different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
