MGFF-TDNN: A Multi-Granularity Feature Fusion TDNN Model with Depth-Wise Separable Module for Speaker Verification
Ya Li, Bin Zhou, Bo Hu

TL;DR
This paper presents MGFF-TDNN, a novel speaker verification model that combines multi-granularity feature fusion with depth-wise separable convolutions to improve discriminative power and efficiency.
Contribution
The paper introduces MGFF-TDNN, integrating multi-granularity feature fusion with depth-wise separable modules for enhanced and efficient speaker verification.
Findings
Achieves superior performance on VoxCeleb dataset.
Maintains efficiency with fewer parameters and lower computational cost.
Effectively captures both global and fine-grained speaker features.
Abstract
In speaker verification, traditional models often emphasize modeling long-term contextual features to capture global speaker characteristics. However, this approach can neglect fine-grained voiceprint information, which contains highly discriminative features essential for robust speaker embeddings. This paper introduces a novel model architecture, termed MGFF-TDNN, based on multi-granularity feature fusion. The MGFF-TDNN leverages a two-dimensional depth-wise separable convolution module, enhanced with local feature modeling, as a front-end feature extractor to effectively capture time-frequency domain features. To achieve comprehensive multi-granularity feature fusion, we propose the M-TDNN structure, which integrates global contextual modeling with fine-grained feature extraction by combining time-delay neural networks and phoneme-level feature pooling. Experiments on the VoxCeleb…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
