Evaluating Speaker Identity Coding in Self-supervised Models and Humans

Gasser Elbanna

arXiv:2406.10401·eess.AS·June 18, 2024

Evaluating Speaker Identity Coding in Self-supervised Models and Humans

Gasser Elbanna

PDF

Open Access

TL;DR

This study compares self-supervised speech models and human perception in speaker identification, revealing models' superior performance and their potential to mirror human and neural responses in natural speech contexts.

Contribution

It demonstrates that self-supervised models outperform acoustic features in speaker ID and can elucidate the neural basis of speaker perception.

Findings

01

Self-supervised models outperform acoustic features in speaker identification.

02

Model and human performance show similarities across speech variants.

03

Some models can predict brain responses during natural speech.

Abstract

Speaker identity plays a significant role in human communication and is being increasingly used in societal applications, many through advances in machine learning. Speaker identity perception is an essential cognitive phenomenon that can be broadly reduced to two main tasks: recognizing a voice or discriminating between voices. Several studies have attempted to identify acoustic correlates of identity perception to pinpoint salient parameters for such a task. Unlike other communicative social signals, most efforts have yielded inefficacious conclusions. Furthermore, current neurocognitive models of voice identity processing consider the bases of perception as acoustic dimensions such as fundamental frequency, harmonics-to-noise ratio, and formant dispersion. However, these findings do not account for naturalistic speech and within-speaker variability. Representational spaces of current…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis