Improving Transformer-based Networks With Locality For Automatic Speaker   Verification

Mufan Sang; Yong Zhao; Gang Liu; John H.L. Hansen; Jian Wu

arXiv:2302.08639·eess.AS·March 2, 2023

Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Mufan Sang, Yong Zhao, Gang Liu, John H.L. Hansen, Jian Wu

PDF

Open Access

TL;DR

This paper introduces two novel Transformer-based models with enhanced locality mechanisms for speaker verification, significantly improving accuracy over existing Transformer and CNN models on multiple datasets.

Contribution

It proposes the Locality-Enhanced Conformer and Speaker Swin Transformer, integrating locality modeling into Transformer architectures for improved speaker embedding extraction.

Findings

01

Achieved 0.75% EER on VoxCeleb 1, outperforming prior models.

02

Reduced EER by 14.6% relative on MS-internal dataset compared to Res2Net50.

03

Demonstrated effectiveness of locality-enhanced Transformers in speaker verification.

Abstract

Recently, Transformer-based architectures have been explored for speaker embedding extraction. Although the Transformer employs the self-attention mechanism to efficiently model the global interaction between token embeddings, it is inadequate for capturing short-range local context, which is essential for the accurate extraction of speaker information. In this study, we enhance the Transformer with the enhanced locality modeling in two directions. First, we propose the Locality-Enhanced Conformer (LE-Confomer) by introducing depth-wise convolution and channel-wise attention into the Conformer blocks. Second, we present the Speaker Swin Transformer (SST) by adapting the Swin Transformer, originally proposed for vision tasks, into speaker embedding network. We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Test · Linear Layer · Stochastic Depth · Label Smoothing · Dense Connections · Absolute Position Encodings · Adam · Softmax