Robust Text-Dependent Speaker Verification via Character-Level Information Preservation for the SdSV Challenge 2020
Sung Hwan Mun, Woo Hyun Kang, Min Hyun Han, Nam Soo Kim

TL;DR
This paper presents a robust text-dependent speaker verification system that preserves character-level information using novel pooling and score compensation methods based on CTC-ASR, achieving state-of-the-art results in the SdSV Challenge 2020.
Contribution
It introduces new pooling and score compensation techniques leveraging CTC-based ASR to enhance phrase-dependent information in speaker verification embeddings.
Findings
Improved verification performance with 0.0785% MinDCF and 2.23% EER.
Fusion of multiple systems yields best results.
Proposed methods outperform conventional pooling techniques.
Abstract
This paper describes our submission to Task 1 of the Short-duration Speaker Verification (SdSV) challenge 2020. Task 1 is a text-dependent speaker verification task, where both the speaker and phrase are required to be verified. The submitted systems were composed of TDNN-based and ResNet-based front-end architectures, in which the frame-level features were aggregated with various pooling methods (e.g., statistical, self-attentive, ghostVLAD pooling). Although the conventional pooling methods provide embeddings with a sufficient amount of speaker-dependent information, our experiments show that these embeddings often lack phrase-dependent information. To mitigate this problem, we propose a new pooling and score compensation methods that leverage a CTC-based automatic speech recognition (ASR) model for taking the lexical content into account. Both methods showed improvement over the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
