An attention-based backend allowing efficient fine-tuning of transformer   models for speaker verification

Junyi Peng; Oldrich Plchot; Themos Stafylakis; Ladislav Mosner; Lukas; Burget; Jan Cernocky

arXiv:2210.01273·eess.AS·October 5, 2022·SLT

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Junyi Peng, Oldrich Plchot, Themos Stafylakis, Ladislav Mosner, Lukas, Burget, Jan Cernocky

PDF

Open Access 1 Repo

TL;DR

This paper introduces an attention-based backend for efficient fine-tuning of transformer models in speaker verification, achieving state-of-the-art results with reduced training time by employing novel feature extraction, regularization, and layer-specific learning rates.

Contribution

It proposes a multi-head factorized attentive pooling method and layer-specific regularization and learning rates to enhance fine-tuning of pre-trained transformers for speaker verification.

Findings

01

Achieved SOTA EERs of 0.59%, 0.79%, and 1.77% on Vox1-O, Vox1-E, Vox1-H.

02

Reduced training time to 4 hours.

03

Demonstrated effectiveness of feature extraction and regularization strategies.

Abstract

In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models to speaker verification task have yet to be fully explored. In this paper, we analyze several feature extraction approaches built on top of a pre-trained model, as well as regularization and learning rate schedule to stabilize the fine-tuning process and further boost performance: multi-head factorized attentive pooling is proposed to factorize the comparison of speaker representations into multiple phonetic clusters. We regularize towards the parameters of the pre-trained model and we set different learning rates for each layer of the pre-trained model during fine-tuning. The experimental results show our method can significantly shorten the training time to 4 hours and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BUTSpeechFIT/wespeaker_ssl_public
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing