Fine-tuning wav2vec2 for speaker recognition

Nik Vaessen; David A. van Leeuwen

arXiv:2109.15053·cs.SD·May 9, 2022

Fine-tuning wav2vec2 for speaker recognition

Nik Vaessen, David A. van Leeuwen

PDF

4 Repos

TL;DR

This paper adapts the wav2vec2 framework for speaker recognition, proposing new pooling and classification methods, and demonstrates competitive results against established baselines.

Contribution

It introduces a novel approach to use wav2vec2 for speaker recognition, including a new pooling method and classification variants, with improved performance.

Findings

01

w2v2-aam achieves 1.88% EER on voxceleb1

02

Proposed methods outperform some baselines

03

Code is publicly available

Abstract

This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recognition. We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding. To adapt the framework to speaker recognition, we propose a single-utterance classification variant with CE or AAM softmax loss, and an utterance-pair classification variant with BCE loss. Our best performing variant, w2v2-aam, achieves a 1.88% EER on the extended voxceleb1 test set compared to 1.69% EER with an ECAPA-TDNN baseline. Code is available at https://github.com/nikvaessen/w2v2-speaker.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTest · Softmax