Improving Transformer-based Networks With Locality For Automatic Speaker Verification
Mufan Sang, Yong Zhao, Gang Liu, John H.L. Hansen, Jian Wu

TL;DR
This paper introduces two novel Transformer-based models with enhanced locality mechanisms for speaker verification, significantly improving accuracy over existing Transformer and CNN models on multiple datasets.
Contribution
It proposes the Locality-Enhanced Conformer and Speaker Swin Transformer, integrating locality modeling into Transformer architectures for improved speaker embedding extraction.
Findings
Achieved 0.75% EER on VoxCeleb 1, outperforming prior models.
Reduced EER by 14.6% relative on MS-internal dataset compared to Res2Net50.
Demonstrated effectiveness of locality-enhanced Transformers in speaker verification.
Abstract
Recently, Transformer-based architectures have been explored for speaker embedding extraction. Although the Transformer employs the self-attention mechanism to efficiently model the global interaction between token embeddings, it is inadequate for capturing short-range local context, which is essential for the accurate extraction of speaker information. In this study, we enhance the Transformer with the enhanced locality modeling in two directions. First, we propose the Locality-Enhanced Conformer (LE-Confomer) by introducing depth-wise convolution and channel-wise attention into the Conformer blocks. Second, we present the Speaker Swin Transformer (SST) by adapting the Swin Transformer, originally proposed for vision tasks, into speaker embedding network. We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Test · Linear Layer · Stochastic Depth · Label Smoothing · Dense Connections · Absolute Position Encodings · Adam · Softmax
