Evaluating the reliability of acoustic speech embeddings

Robin Algayres; Mohamed Salah Zaiem; Benoit Sagot; Emmanuel Dupoux

arXiv:2007.13542·eess.AS·November 9, 2020

Evaluating the reliability of acoustic speech embeddings

Robin Algayres, Mohamed Salah Zaiem, Benoit Sagot, Emmanuel Dupoux

PDF

Open Access

TL;DR

This paper systematically compares two metrics, ABX discrimination and MAP, across multiple languages and embedding methods to evaluate speech embedding quality and their effectiveness in predicting downstream task performance.

Contribution

It provides a comprehensive analysis of existing metrics for speech embeddings and highlights their correlations and limitations across languages and methods.

Findings

01

ABX and MAP metrics generally correlate with each other.

02

Both metrics correlate with frequency estimation performance.

03

Discrepancies exist in fine-grained distinctions across languages and methods.

Abstract

Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to unsupervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimise the quality of these embeddings in a task-neutral way. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods, ranging from supervised to fully unsupervised, and using different loss functions (autoencoders, correspondence autoencoders, siamese). Then we use the ABX and MAP to predict performances on a new downstream task: the unsupervised estimation of the frequencies of speech segments in a given corpus. We find that overall, ABX and MAP correlate with one another and with frequency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Topic Modeling