Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification
Nian Li, Jianguo Wei

TL;DR
This paper introduces a novel transformer-based speaker verification model that combines neighborhood and global attention with progressive channel fusion, achieving state-of-the-art results on VoxCeleb datasets with less training data.
Contribution
The paper proposes PCF-NAT, a new backbone network that integrates neighborhood and global attention with progressive channel fusion for improved speaker verification.
Findings
Over 20% lower EER and minDCF than ECAPA-TDNN for similar model sizes.
Achieves less than 0.5% EER on VoxCeleb1-O with deep PCF-NAT.
Competitive performance using VoxCeleb2 alone for training.
Abstract
Transformer-based architectures for speaker verification typically require more training data than ECAPA-TDNN. Therefore, recent work has generally been trained on VoxCeleb1&2. We propose a backbone network based on self-attention, which can achieve competitive results when trained on VoxCeleb2 alone. The network alternates between neighborhood attention and global attention to capture local and global features, then aggregates features of different hierarchical levels, and finally performs attentive statistics pooling. Additionally, we employ a progressive channel fusion strategy to expand the receptive field in the channel dimension as the network deepens. We trained the proposed PCF-NAT model on VoxCeleb2 and evaluated it on VoxCeleb1 and the validation sets of VoxSRC. The EER and minDCF of the shallow PCF-NAT are on average more than 20% lower than those of similarly sized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
