MACCIF-TDNN: Multi aspect aggregation of channel and context   interdependence features in TDNN-based speaker verification

Fangyuan Wang; Zhigang Song; Hongchen Jiang; Bo Xu

arXiv:2107.03104·cs.SD·July 8, 2021·1 cites

MACCIF-TDNN: Multi aspect aggregation of channel and context interdependence features in TDNN-based speaker verification

Fangyuan Wang, Zhigang Song, Hongchen Jiang, Bo Xu

PDF

Open Access

TL;DR

This paper introduces MACCIF-TDNN, a novel TDNN-based speaker verification architecture that combines multi-aspect channel and context interdependence features using SE-Res2Blocks, Transformer encoders, and multi-head pooling, achieving state-of-the-art results.

Contribution

It proposes a new multi-aspect aggregation architecture for speaker verification that integrates channel and context features with innovative modules and pooling strategies.

Findings

01

Outperforms most TDNN-based systems on VoxCeleb1.

02

Effectively models long-term temporal features.

03

Enhances feature discrimination with multi-head pooling.

Abstract

Most of the recent state-of-the-art results for speaker verification are achieved by X-vector and its subsequent variants. In this paper, we propose a new network architecture which aggregates the channel and context interdependence features from multi aspect based on Time Delay Neural Network (TDNN). Firstly, we use the SE-Res2Blocks as in ECAPA-TDNN to explicitly model the channel interdependence to realize adaptive calibration of channel features, and process local context features in a multi-scale way at a more granular level compared with conventional TDNN-based methods. Secondly, we explore to use the encoder structure of Transformer to model the global context interdependence features at an utterance level which can capture better long term temporal characteristics. Before the pooling layer, we aggregate the outputs of SE-Res2Blocks and Transformer encoder to leverage the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Softmax · Dense Connections · Adam