Evaluating Speaker Identity Coding in Self-supervised Models and Humans
Gasser Elbanna

TL;DR
This study compares self-supervised speech models and human perception in speaker identification, revealing models' superior performance and their potential to mirror human and neural responses in natural speech contexts.
Contribution
It demonstrates that self-supervised models outperform acoustic features in speaker ID and can elucidate the neural basis of speaker perception.
Findings
Self-supervised models outperform acoustic features in speaker identification.
Model and human performance show similarities across speech variants.
Some models can predict brain responses during natural speech.
Abstract
Speaker identity plays a significant role in human communication and is being increasingly used in societal applications, many through advances in machine learning. Speaker identity perception is an essential cognitive phenomenon that can be broadly reduced to two main tasks: recognizing a voice or discriminating between voices. Several studies have attempted to identify acoustic correlates of identity perception to pinpoint salient parameters for such a task. Unlike other communicative social signals, most efforts have yielded inefficacious conclusions. Furthermore, current neurocognitive models of voice identity processing consider the bases of perception as acoustic dimensions such as fundamental frequency, harmonics-to-noise ratio, and formant dispersion. However, these findings do not account for naturalistic speech and within-speaker variability. Representational spaces of current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
