The SJTU System for Short-duration Speaker Verification Challenge 2021
Bing Han, Zhengyang Chen, Zhikai Zhou, Yanmin Qian

TL;DR
This paper describes the SJTU system for short-duration speaker verification in the 2021 challenge, utilizing robust embeddings, adaptive scoring, and fine-tuning strategies to improve cross-lingual and phrase-dependent verification accuracy.
Contribution
The paper introduces novel phrase-aware fine-tuning and neural PLDA methods, along with adaptive scoring techniques, to enhance speaker verification performance in challenging scenarios.
Findings
Achieved 0.0473 EER in Task 1 (rank 3)
Achieved 0.0581 EER in Task 2 (rank 8)
Demonstrated effectiveness of phrase-aware strategies
Abstract
This paper presents the SJTU system for both text-dependent and text-independent tasks in short-duration speaker verification (SdSV) challenge 2021. In this challenge, we explored different strong embedding extractors to extract robust speaker embedding. For text-independent task, language-dependent adaptive snorm is explored to improve the system performance under the cross-lingual verification condition. For text-dependent task, we mainly focus on the in-domain fine-tuning strategies based on the model pre-trained on large-scale out-of-domain data. In order to improve the distinction between different speakers uttering the same phrase, we proposed several novel phrase-aware fine-tuning strategies and phrase-aware neural PLDA. With such strategies, the system performance is further improved. Finally, we fused the scores of different systems, and our fusion systems achieved 0.0473 in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
