Text-Dependent Speaker Verification (TdSV) Challenge 2024: Team Naive System Report
Amir Mohammad Rostami, Pourya Jafarzadeh

TL;DR
This paper reports a system for the 2024 TdSV Challenge that combines neural networks, data augmentation, and ensemble learning to achieve low error rates in speaker and phrase verification.
Contribution
The paper introduces a multi-model ensemble approach using adapted neural networks and a lightweight model trained on challenge data for improved verification performance.
Findings
Achieved MinDCF of 0.0461 and EER of 1.3% on the challenge.
Effective combination of neural architectures and data augmentation.
Ensemble learning enhances speaker and phrase verification accuracy.
Abstract
This paper presents a system for the 2024 Text-Dependent Speaker Verification (TdSV) Challenge. The system achieved a Minimum Detection Cost Function (MinDCF) of 0.0461 and an Equal Error Rate (EER) of 1.3\%. Our approach focused on adapting existing state-of-the-art neural networks, ResNet-TDNN and NeXt-TDNN, originally trained on the VoxCeleb dataset. This strategy was chosen because of the limited challenge duration and the available resources at the time. In addition, we designed a lightweight and resource-efficient model, EfficientNet-A0, trained specifically on the challenge dataset to improve adaptation and strengthen the ensemble approach. Our system combines advanced neural architectures, extensive data augmentation, and optimised hyperparameters. These components helped achieve strong performance in text-dependent speaker verification. The results also demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
