Cosine-Distance Virtual Adversarial Training for Semi-Supervised Speaker-Discriminative Acoustic Embeddings
Florian L. Kreyssig, Philip C. Woodland

TL;DR
This paper introduces CD-VAT, a semi-supervised learning method that enhances speaker-discriminative acoustic embeddings by leveraging unlabelled data through cosine-distance based adversarial training, improving speaker verification accuracy.
Contribution
The paper presents a novel semi-supervised training technique, CD-VAT, which does not require unlabelled data to share speaker labels with labelled data, unlike previous methods.
Findings
Achieves 11.1% relative reduction in EER on VoxCeleb dataset.
Demonstrates effectiveness of cosine-distance based adversarial training.
Provides significant improvement over purely supervised baseline.
Abstract
In this paper, we propose a semi-supervised learning (SSL) technique for training deep neural networks (DNNs) to generate speaker-discriminative acoustic embeddings (speaker embeddings). Obtaining large amounts of speaker recognition train-ing data can be difficult for desired target domains, especially under privacy constraints. The proposed technique reduces requirements for labelled data by leveraging unlabelled data. The technique is a variant of virtual adversarial training (VAT) [1] in the form of a loss that is defined as the robustness of the speaker embedding against input perturbations, as measured by the cosine-distance. Thus, we term the technique cosine-distance virtual adversarial training (CD-VAT). In comparison to many existing SSL techniques, the unlabelled data does not have to come from the same set of classes (here speakers) as the labelled data. The effectiveness of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
