Weakly Supervised Training of Speaker Identification Models
Martin Karu, Tanel Alum\"ae

TL;DR
This paper introduces a weakly supervised training approach for speaker identification that leverages recording-level labels, diarization, and i-vectors to achieve high accuracy without needing speaker annotations at the segment level.
Contribution
The authors develop a novel training method that uses recording-level labels and speaker diarization to train speaker identification models without detailed annotations.
Findings
Achieved 94.6% accuracy on VoxCeleb dataset.
Attained 66% recall at 93% precision on broadcast news dataset.
Outperformed baseline methods significantly.
Abstract
We propose an approach for training speaker identification models in a weakly supervised manner. We concentrate on the setting where the training data consists of a set of audio recordings and the speaker annotation is provided only at the recording level. The method uses speaker diarization to find unique speakers in each recording, and i-vectors to project the speech of each speaker to a fixed-dimensional vector. A neural network is then trained to map i-vectors to speakers, using a special objective function that allows to optimize the model using recording-level speaker labels. We report experiments on two different real-world datasets. On the VoxCeleb dataset, the method provides 94.6% accuracy on a closed set speaker identification task, surpassing the baseline performance by a large margin. On an Estonian broadcast news dataset, the method provides 66% time-weighted speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
