A framework of text-dependent speaker verification for chinese numerical string corpus
Litong Zheng, Feng Hong, Weijie Xu, Wan Zheng

TL;DR
This paper introduces an end-to-end text-dependent speaker verification system for Chinese numerical strings that separates speaker and text information, improving accuracy significantly on specific corpora.
Contribution
The paper proposes a novel decoupling approach in TD-SV using advanced neural modules and data augmentation, enhancing performance on Chinese numerical speech datasets.
Findings
Achieved 49.2% EER reduction on Hi-Mia
Achieved 75.0% EER reduction on SHAL
Introduced a publicly available Chinese numerical corpus
Abstract
The Chinese numerical string corpus, serves as a valuable resource for speaker verification, particularly in financial transactions. Researches indicate that in short speech scenarios, text-dependent speaker verification (TD-SV) consistently outperforms text-independent speaker verification (TI-SV). However, TD-SV potentially includes the validation of text information, that can be negatively impacted by reading rhythms and pauses. To address this problem, we propose an end-to-end speaker verification system that enhances TD-SV by decoupling speaker and text information. Our system consists of a text embedding extractor, a speaker embedding extractor and a fusion module. In the text embedding extractor, we employ an enhanced Transformer and introduce a triple loss including text classification loss, connectionist temporal classification (CTC) loss and decoder loss; while in the speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
