Weakly Supervised Training of Speaker Identification Models

Martin Karu; Tanel Alum\"ae

arXiv:1806.08621·cs.SD·June 25, 2018

Weakly Supervised Training of Speaker Identification Models

Martin Karu, Tanel Alum\"ae

PDF

TL;DR

This paper introduces a weakly supervised training approach for speaker identification that leverages recording-level labels, diarization, and i-vectors to achieve high accuracy without needing speaker annotations at the segment level.

Contribution

The authors develop a novel training method that uses recording-level labels and speaker diarization to train speaker identification models without detailed annotations.

Findings

01

Achieved 94.6% accuracy on VoxCeleb dataset.

02

Attained 66% recall at 93% precision on broadcast news dataset.

03

Outperformed baseline methods significantly.

Abstract

We propose an approach for training speaker identification models in a weakly supervised manner. We concentrate on the setting where the training data consists of a set of audio recordings and the speaker annotation is provided only at the recording level. The method uses speaker diarization to find unique speakers in each recording, and i-vectors to project the speech of each speaker to a fixed-dimensional vector. A neural network is then trained to map i-vectors to speakers, using a special objective function that allows to optimize the model using recording-level speaker labels. We report experiments on two different real-world datasets. On the VoxCeleb dataset, the method provides 94.6% accuracy on a closed set speaker identification task, surpassing the baseline performance by a large margin. On an Estonian broadcast news dataset, the method provides 66% time-weighted speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.