DS-TDNN: Dual-stream Time-delay Neural Network with Global-aware Filter for Speaker Verification
Yangfu Li, Jiapan Gan, Xiaodan Lin

TL;DR
This paper introduces DS-TDNN, a dual-stream neural network with a global-aware filter layer that captures long-range dependencies, significantly improving speaker verification accuracy especially on longer utterances while reducing computational costs.
Contribution
The paper proposes a novel GF layer and a dual-stream TDNN architecture that effectively model global context and local features simultaneously for speaker verification.
Findings
Achieves 10% relative improvement over ECAPA-TDNN in speaker verification.
Reduces computational cost by 20% compared to ECAPA-TDNN.
Outperforms residual and attention-based models on variable-length utterances.
Abstract
Conventional time-delay neural networks (TDNNs) struggle to handle long-range context, their ability to represent speaker information is therefore limited in long utterances. Existing solutions either depend on increasing model complexity or try to balance between local features and global context to address this issue. To effectively leverage the long-term dependencies of audio signals and constrain model complexity, we introduce a novel module called Global-aware Filter layer (GF layer) in this work, which employs a set of learnable transform-domain filters between a 1D discrete Fourier transform and its inverse transform to capture global context. Additionally, we develop a dynamic filtering strategy and a sparse regularization method to enhance the performance of the GF layer and prevent overfitting. Based on the GF layer, we present a dual-stream TDNN architecture called DS-TDNN…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
