Learning from human perception to improve automatic speaker verification   in style-mismatched conditions

Amber Afshan; Abeer Alwan

arXiv:2206.13684·eess.AS·June 29, 2022·1 cites

Learning from human perception to improve automatic speaker verification in style-mismatched conditions

Amber Afshan, Abeer Alwan

PDF

Open Access

TL;DR

This paper introduces a novel training loss function inspired by human perception to enhance automatic speaker verification performance under style-mismatched conditions, demonstrating significant improvements across multiple datasets.

Contribution

The paper proposes the CllrCE loss, integrating human perceptual insights into training to better handle style variability in speaker verification systems.

Findings

01

CllrCE loss improves EER by up to 66% on UCLA database.

02

Significant reductions in minDCF observed with the new loss function.

03

Performance gains are consistent with conditioning in SITW evaluations.

Abstract

Our prior experiments show that humans and machines seem to employ different approaches to speaker discrimination, especially in the presence of speaking style variability. The experiments examined read versus conversational speech. Listeners focused on speaker-specific idiosyncrasies while "telling speakers together", and on relative distances in a shared acoustic space when "telling speakers apart". However, automatic speaker verification (ASV) systems use the same loss function irrespective of target or non-target trials. To improve ASV performance in the presence of style variability, insights learnt from human perception are used to design a new training loss function that we refer to as "CllrCE loss". CllrCE loss uses both speaker-specific idiosyncrasies and relative acoustic distances between speakers to train the ASV system. When using the UCLA speaker variability database, in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems