BUT System Description to VoxCeleb Speaker Recognition Challenge 2019

Hossein Zeinali; Shuai Wang; Anna Silnova; Pavel Mat\v{e}jka,; Old\v{r}ich Plchot

arXiv:1910.12592·eess.AS·October 29, 2019·79 cites

BUT System Description to VoxCeleb Speaker Recognition Challenge 2019

Hossein Zeinali, Shuai Wang, Anna Silnova, Pavel Mat\v{e}jka,, Old\v{r}ich Plchot

PDF

Open Access 1 Models

TL;DR

This paper details the BUT team's submission to the VoxCeleb Speaker Recognition Challenge 2019, describing system architectures, training strategies, and performance results on VoxCeleb-1 test sets.

Contribution

It introduces a fusion of four CNN-based speaker recognition systems, including ResNet34 and x-vector topologies, with fine-tuning and feature strategies, achieving state-of-the-art results.

Findings

01

Best fixed condition ERR: 1.42%

02

Best open condition ERR: 1.26%

03

Fusion of multiple CNN topologies improves accuracy

Abstract

In this report, we describe the submission of Brno University of Technology (BUT) team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019. We also provide a brief analysis of different systems on VoxCeleb-1 test sets. Submitted systems for both Fixed and Open conditions are a fusion of 4 Convolutional Neural Network (CNN) topologies. The first and second networks have ResNet34 topology and use two-dimensional CNNs. The last two networks are one-dimensional CNN and are based on the x-vector extraction topology. Some of the networks are fine-tuned using additive margin angular softmax. Kaldi FBanks and Kaldi PLPs were used as features. The difference between Fixed and Open systems lies in the used training data and fusion strategy. The best systems for Fixed and Open conditions achieved 1.42% and 1.26% ERR on the challenge evaluation set respectively.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Wespeaker/wespeaker-voxceleb-resnet34-LM
model· 75 dl· ♡ 8
75 dl♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsTest