TL;DR
The paper introduces Crossed-Time Delay Neural Network (CTDNN), a novel architecture that enhances speaker recognition performance by employing multiple context-sized delay units, outperforming existing TDNN models in verification and identification tasks.
Contribution
The paper proposes a new CTDNN structure inspired by CNN multi-filters, significantly improving speaker recognition accuracy and training efficiency over traditional TDNN and FTDNN models.
Findings
Outperforms TDNN with 2.6% EER reduction on VoxCeleb1.
Achieves 90.4% identification accuracy in few-shot scenarios.
Demonstrates 36% accuracy improvement over FTDNN.
Abstract
Time Delay Neural Network (TDNN) is a well-performing structure for DNN-based speaker recognition systems. In this paper we introduce a novel structure Crossed-Time Delay Neural Network (CTDNN) to enhance the performance of current TDNN. Inspired by the multi-filters setting of convolution layer from convolution neural network, we set multiple time delay units each with different context size at the bottom layer and construct a multilayer parallel network. The proposed CTDNN gives significant improvements over original TDNN on both speaker verification and identification tasks. It outperforms in VoxCeleb1 dataset in verification experiment with a 2.6% absolute Equal Error Rate improvement. In few shots condition CTDNN reaches 90.4% identification accuracy, which doubles the identification accuracy of original TDNN. We also compare the proposed CTDNN with another new variant of TDNN,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
