Memory-Efficient Training for Text-Dependent SV with Independent Pre-trained Models

Seyed Ali Farokh; Hossein Zeinali

arXiv:2411.10828·eess.AS·January 13, 2026

Memory-Efficient Training for Text-Dependent SV with Independent Pre-trained Models

Seyed Ali Farokh, Hossein Zeinali

PDF

Open Access

TL;DR

This paper introduces a memory-efficient, independent pre-trained model approach for text-dependent speaker verification, achieving competitive results without extensive joint fine-tuning or high computational costs.

Contribution

It proposes a novel method using separate pre-trained models with domain adaptation for efficient speaker verification, avoiding joint fine-tuning of large models.

Findings

01

Achieved a MinDCF of 0.0358 on the challenge evaluation set.

02

Secured first place in the TdSV 2024 challenge.

03

Demonstrated competitive performance with reduced computational resources.

Abstract

This paper presents our submission to the Iranian division of the Text-Dependent Speaker Verification Challenge (TdSV) 2024. Conventional TdSV approaches typically jointly model speaker and linguistic features, requiring unsegmented inputs during training and incurring high computational costs. Additionally, these methods often fine-tune large-scale pre-trained speaker embedding models on the target domain dataset, which may compromise the pre-trained models' original ability to capture speaker-specific characteristics. To overcome these limitations, we employ a TdSV system that utilizes two pre-trained models independently and demonstrate that, by leveraging pre-trained models with targeted domain adaptation, competitive results can be achieved while avoiding the substantial computational costs associated with joint fine-tuning on unsegmented inputs in conventional approaches. Our best…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis