Beijing ZKJ-NPU Speaker Verification System for VoxCeleb Speaker   Recognition Challenge 2021

Li Zhang; Huan Zhao; Qinling Meng; Yanli Chen; Min Liu; Lei Xie

arXiv:2109.03568·cs.SD·November 19, 2021·5 cites

Beijing ZKJ-NPU Speaker Verification System for VoxCeleb Speaker Recognition Challenge 2021

Li Zhang, Huan Zhao, Qinling Meng, Yanli Chen, Min Liu, Lei Xie

PDF

Open Access

TL;DR

This paper presents the Beijing ZKJ-NPU system for VoxCeleb Speaker Recognition Challenge 2021, utilizing advanced neural networks and normalization techniques to achieve second place in both tracks.

Contribution

Introduction of novel CNN-based models like ResNet-DTCF, CoAtNet, and PyConv for improved speaker verification performance.

Findings

01

Achieved top-tier performance with minDCF/EER of 0.1205/2.8160% and 0.1175/2.8400%.

02

Fused multiple systems to enhance accuracy.

03

Secured second place in both challenge tracks.

Abstract

In this report, we describe the Beijing ZKJ-NPU team submission to the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21). We participated in the fully supervised speaker verification track 1 and track 2. In the challenge, we explored various kinds of advanced neural network structures with different pooling layers and objective loss functions. In addition, we introduced the ResNet-DTCF, CoAtNet and PyConv networks to advance the performance of CNN-based speaker embedding model. Moreover, we applied embedding normalization and score normalization at the evaluation stage. By fusing 11 and 14 systems, our final best performances (minDCF/EER) on the evaluation trails are 0.1205/2.8160% and 0.1175/2.8400% respectively for track 1 and 2. With our submission, we came to the second place in the challenge for both tracks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing