Time-Contrastive Learning Based DNN Bottleneck Features for Text-Dependent Speaker Verification
Achintya Kr. Sarkar, Zheng-Hua Tan

TL;DR
This paper introduces a novel time-contrastive learning approach for DNN bottleneck feature extraction that leverages temporal structure in speech, improving text-dependent speaker verification performance.
Contribution
It proposes a TCL-based BN feature extraction method that learns generic features from unlabeled temporal segments, outperforming traditional speaker and pass-phrase discriminant features.
Findings
TCL-BN features outperform existing BN and MFCC features in speaker verification.
The method effectively captures temporal structure for robust feature learning.
Experimental results on RedDots Challenge 2016 validate the approach's superiority.
Abstract
In this paper, we present a time-contrastive learning (TCL) based bottleneck (BN)feature extraction method for speech signals with an application to text-dependent (TD) speaker verification (SV). It is well-known that speech signals exhibit quasi-stationary behavior in and only in a short interval, and the TCL method aims to exploit this temporal structure. More specifically, it trains deep neural networks (DNNs) to discriminate temporal events obtained by uniformly segmenting speech signals, in contrast to existing DNN based BN feature extraction methods that train DNNs using labeled data to discriminate speakers or pass-phrases or phones or a combination of them. In the context of speaker verification, speech data of fixed pass-phrases are used for TCL-BN training, while the pass-phrases used for TCL-BN training are excluded from being used for SV, so that the learned features can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
