TL;DR
This paper introduces a deep multi-metric learning approach for text-independent speaker verification, combining three loss functions to improve feature extraction and achieve state-of-the-art results on a large-scale dataset.
Contribution
It proposes a novel multi-metric learning framework with three cooperative loss functions for speaker verification, enhancing feature extraction with residual and attention mechanisms.
Findings
Achieved an equal error rate of 3.48% on VoxCeleb2 dataset.
Demonstrated competitive performance comparable to state-of-the-art systems.
Provided the first large-scale open-source code for this task.
Abstract
Text-independent speaker verification is an important artificial intelligence problem that has a wide spectrum of applications, such as criminal investigation, payment certification, and interest-based customer services. The purpose of text-independent speaker verification is to determine whether two given uncontrolled utterances originate from the same speaker or not. Extracting speech features for each speaker using deep neural networks is a promising direction to explore and a straightforward solution is to train the discriminative feature extraction network by using a metric learning loss function. However, a single loss function often has certain limitations. Thus, we use deep multi-metric learning to address the problem and introduce three different losses for this problem, i.e., triplet loss, n-pair loss and angular loss. The three loss functions work in a cooperative way to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
