ShaneRun System Description to VoxCeleb Speaker Recognition Challenge 2020
Shen Chen

TL;DR
This paper details ShaneRun's speaker recognition system for VoxCeleb 2020, utilizing ResNet-34 embeddings and a novel fusion method, achieving improved performance over the baseline in the challenge.
Contribution
Introduction of a simple t-SNE based fusion method and application of ResNet-34 for speaker embedding extraction in VoxCeleb challenge.
Findings
Achieved 0.3098 minDCF, outperforming baseline by 1.3%.
Achieved 5.076% ERR, outperforming baseline by 2.2%.
Demonstrated effectiveness of t-SNE normalized distance fusion.
Abstract
In this report, we describe the submission of ShaneRun's team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020. We use ResNet-34 as encoder to extract the speaker embeddings, which is referenced from the open-source voxceleb-trainer. We also provide a simple method to implement optimum fusion using t-SNE normalized distance of testing utterance pairs instead of original negative Euclidean distance from the encoder. The final submitted system got 0.3098 minDCF and 5.076 % ERR for Fixed data track, which outperformed the baseline by 1.3 % minDCF and 2.2 % ERR respectively.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
