NeXt-TDNN: Modernizing Multi-Scale Temporal Convolution Backbone for Speaker Verification

Hyun-Jun Heo; Ui-Hyeop Shin; Ran Lee; YoungJu Cheon; Hyung-Min Park

arXiv:2312.08603·eess.AS·April 1, 2026·1 cites

NeXt-TDNN: Modernizing Multi-Scale Temporal Convolution Backbone for Speaker Verification

Hyun-Jun Heo, Ui-Hyeop Shin, Ran Lee, YoungJu Cheon, Hyung-Min Park

PDF

1 Repo

TL;DR

This paper introduces NeXt-TDNN, a modernized multi-scale temporal convolution backbone for speaker verification, inspired by ConvNet structures, which improves performance and efficiency over previous models.

Contribution

The paper proposes a novel 1D two-step multi-scale ConvNeXt block for TDNN, incorporating global response normalization, leading to enhanced speaker verification accuracy and reduced computational cost.

Findings

01

NeXt-TDNN outperforms ECAPA-TDNN in speaker verification accuracy.

02

The model reduces parameter size and inference time.

03

Experimental results validate the effectiveness of the new backbone design.

Abstract

In speaker verification, ECAPA-TDNN has shown remarkable improvement by utilizing one-dimensional(1D) Res2Net block and squeeze-and-excitation(SE) module, along with multi-layer feature aggregation (MFA). Meanwhile, in vision tasks, ConvNet structures have been modernized by referring to Transformer, resulting in improved performance. In this paper, we present an improved block design for TDNN in speaker verification. Inspired by recent ConvNet structures, we replace the SE-Res2Net block in ECAPA-TDNN with a novel 1D two-step multi-scale ConvNeXt block, which we call TS-ConvNeXt. The TS-ConvNeXt block is constructed using two separated sub-modules: a temporal multi-scale convolution (MSC) and a frame-wise feed-forward network (FFN). This two-step design allows for flexible capturing of inter-frame and intra-frame contexts. Additionally, we introduce global response normalization (GRN)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.