TL;DR
This paper adapts the wav2vec2 framework for speaker recognition, proposing new pooling and classification methods, and demonstrates competitive results against established baselines.
Contribution
It introduces a novel approach to use wav2vec2 for speaker recognition, including a new pooling method and classification variants, with improved performance.
Findings
w2v2-aam achieves 1.88% EER on voxceleb1
Proposed methods outperform some baselines
Code is publicly available
Abstract
This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recognition. We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding. To adapt the framework to speaker recognition, we propose a single-utterance classification variant with CE or AAM softmax loss, and an utterance-pair classification variant with BCE loss. Our best performing variant, w2v2-aam, achieves a 1.88% EER on the extended voxceleb1 test set compared to 1.69% EER with an ECAPA-TDNN baseline. Code is available at https://github.com/nikvaessen/w2v2-speaker.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTest · Softmax
