Delving into VoxCeleb: environment invariant speaker recognition

Joon Son Chung; Jaesung Huh; Seongkyu Mun

arXiv:1910.11238·cs.SD·February 4, 2020·1 cites

Delving into VoxCeleb: environment invariant speaker recognition

Joon Son Chung, Jaesung Huh, Seongkyu Mun

PDF

Open Access 1 Repo

TL;DR

This paper introduces an environment adversarial training framework that leverages video data in VoxCeleb to learn speaker embeddings invariant to environmental conditions, improving generalization in speaker recognition tasks.

Contribution

It proposes a novel adversarial training method utilizing video information to enhance environment invariance in speaker embeddings, which was not explored before.

Findings

01

Significant performance improvements over baselines in speaker identification.

02

Enhanced generalization to unseen environmental conditions.

03

Effective use of video data for environment-invariant feature learning.

Abstract

Research in speaker recognition has recently seen significant progress due to the application of neural network models and the availability of new large-scale datasets. There has been a plethora of work in search for more powerful architectures or loss functions suitable for the task, but these works do not consider what information is learnt by the models, apart from being able to predict the given labels. In this work, we introduce an environment adversarial training framework in which the network can effectively learn speaker-discriminative and environment-invariant embeddings without explicit domain shift during training. We achieve this by utilising the previously unused `video' information in the VoxCeleb dataset. The environment adversarial training allows the network to generalise better to unseen conditions. The method is evaluated on both speaker identification and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

theolepage/sslsv
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing