Disentangling Voice and Content with Self-Supervision for Speaker   Recognition

Tianchi Liu; Kong Aik Lee; Qiongqiong Wang; Haizhou Li

arXiv:2310.01128·eess.AS·November 2, 2023·5 cites

Disentangling Voice and Content with Self-Supervision for Speaker Recognition

Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li

PDF

Open Access 1 Video

TL;DR

This paper introduces a self-supervised disentanglement framework for speaker recognition that separates speaker traits from speech content, improving accuracy without additional data or training.

Contribution

It proposes a novel disentanglement approach using Gaussian inference layers and a self-supervision method that does not require extra labels beyond speaker identities.

Findings

01

Achieved 9.56% reduction in EER on VoxCeleb

02

Achieved 8.24% reduction in minDCF on SITW

03

Framework is practical and easily applicable without extra data or training

Abstract

For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is realized with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF, respectively. Since neither additional model training nor data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Disentangling Voice and Content with Self-Supervision for Speaker Recognition· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing