Evaluating the reliability of acoustic speech embeddings
Robin Algayres, Mohamed Salah Zaiem, Benoit Sagot, Emmanuel Dupoux

TL;DR
This paper systematically compares two metrics, ABX discrimination and MAP, across multiple languages and embedding methods to evaluate speech embedding quality and their effectiveness in predicting downstream task performance.
Contribution
It provides a comprehensive analysis of existing metrics for speech embeddings and highlights their correlations and limitations across languages and methods.
Findings
ABX and MAP metrics generally correlate with each other.
Both metrics correlate with frequency estimation performance.
Discrepancies exist in fine-grained distinctions across languages and methods.
Abstract
Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to unsupervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimise the quality of these embeddings in a task-neutral way. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods, ranging from supervised to fully unsupervised, and using different loss functions (autoencoders, correspondence autoencoders, siamese). Then we use the ABX and MAP to predict performances on a new downstream task: the unsupervised estimation of the frequencies of speech segments in a given corpus. We find that overall, ABX and MAP correlate with one another and with frequency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Topic Modeling
