MGFF-TDNN: A Multi-Granularity Feature Fusion TDNN Model with Depth-Wise   Separable Module for Speaker Verification

Ya Li; Bin Zhou; Bo Hu

arXiv:2505.03228·cs.SD·May 7, 2025

MGFF-TDNN: A Multi-Granularity Feature Fusion TDNN Model with Depth-Wise Separable Module for Speaker Verification

Ya Li, Bin Zhou, Bo Hu

PDF

Open Access 1 Repo

TL;DR

This paper presents MGFF-TDNN, a novel speaker verification model that combines multi-granularity feature fusion with depth-wise separable convolutions to improve discriminative power and efficiency.

Contribution

The paper introduces MGFF-TDNN, integrating multi-granularity feature fusion with depth-wise separable modules for enhanced and efficient speaker verification.

Findings

01

Achieves superior performance on VoxCeleb dataset.

02

Maintains efficiency with fewer parameters and lower computational cost.

03

Effectively captures both global and fine-grained speaker features.

Abstract

In speaker verification, traditional models often emphasize modeling long-term contextual features to capture global speaker characteristics. However, this approach can neglect fine-grained voiceprint information, which contains highly discriminative features essential for robust speaker embeddings. This paper introduces a novel model architecture, termed MGFF-TDNN, based on multi-granularity feature fusion. The MGFF-TDNN leverages a two-dimensional depth-wise separable convolution module, enhanced with local feature modeling, as a front-end feature extractor to effectively capture time-frequency domain features. To achieve comprehensive multi-granularity feature fusion, we propose the M-TDNN structure, which integrates global contextual modeling with fine-grained feature extraction by combining time-delay neural networks and phoneme-level feature pooling. Experiments on the VoxCeleb…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leia404/MGFF-TDNN
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing