Memory-Efficient Training for Text-Dependent SV with Independent Pre-trained Models
Seyed Ali Farokh, Hossein Zeinali

TL;DR
This paper introduces a memory-efficient, independent pre-trained model approach for text-dependent speaker verification, achieving competitive results without extensive joint fine-tuning or high computational costs.
Contribution
It proposes a novel method using separate pre-trained models with domain adaptation for efficient speaker verification, avoiding joint fine-tuning of large models.
Findings
Achieved a MinDCF of 0.0358 on the challenge evaluation set.
Secured first place in the TdSV 2024 challenge.
Demonstrated competitive performance with reduced computational resources.
Abstract
This paper presents our submission to the Iranian division of the Text-Dependent Speaker Verification Challenge (TdSV) 2024. Conventional TdSV approaches typically jointly model speaker and linguistic features, requiring unsegmented inputs during training and incurring high computational costs. Additionally, these methods often fine-tune large-scale pre-trained speaker embedding models on the target domain dataset, which may compromise the pre-trained models' original ability to capture speaker-specific characteristics. To overcome these limitations, we employ a TdSV system that utilizes two pre-trained models independently and demonstrate that, by leveraging pre-trained models with targeted domain adaptation, competitive results can be achieved while avoiding the substantial computational costs associated with joint fine-tuning on unsegmented inputs in conventional approaches. Our best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
