Neighborhood Attention Transformer with Progressive Channel Fusion for   Speaker Verification

Nian Li; Jianguo Wei

arXiv:2405.12031·cs.SD·May 31, 2024

Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification

Nian Li, Jianguo Wei

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel transformer-based speaker verification model that combines neighborhood and global attention with progressive channel fusion, achieving state-of-the-art results on VoxCeleb datasets with less training data.

Contribution

The paper proposes PCF-NAT, a new backbone network that integrates neighborhood and global attention with progressive channel fusion for improved speaker verification.

Findings

01

Over 20% lower EER and minDCF than ECAPA-TDNN for similar model sizes.

02

Achieves less than 0.5% EER on VoxCeleb1-O with deep PCF-NAT.

03

Competitive performance using VoxCeleb2 alone for training.

Abstract

Transformer-based architectures for speaker verification typically require more training data than ECAPA-TDNN. Therefore, recent work has generally been trained on VoxCeleb1&2. We propose a backbone network based on self-attention, which can achieve competitive results when trained on VoxCeleb2 alone. The network alternates between neighborhood attention and global attention to capture local and global features, then aggregates features of different hierarchical levels, and finally performs attentive statistics pooling. Additionally, we employ a progressive channel fusion strategy to expand the receptive field in the channel dimension as the network deepens. We trained the proposed PCF-NAT model on VoxCeleb2 and evaluated it on VoxCeleb1 and the validation sets of VoxSRC. The EER and minDCF of the shallow PCF-NAT are on average more than 20% lower than those of similarly sized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ChenNan1996/PCF-NAT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis