Analyzing the Representational Geometry of Acoustic Word Embeddings
Badr M. Abdullah, Dietrich Klakow

TL;DR
This paper investigates how different learning objectives and architectures influence the structure of acoustic word embeddings, using analytical techniques from machine learning and neuroscience to understand their representational geometry.
Contribution
It provides a detailed analysis of the factors shaping acoustic word embeddings, emphasizing the impact of learning objectives over architecture choices.
Findings
Learning objectives significantly influence embedding geometry.
Model architecture has a lesser effect compared to learning objectives.
Analytic techniques reveal differences in embedding space uniformity and discriminability.
Abstract
Acoustic word embeddings (AWEs) are vector representations such that different acoustic exemplars of the same word are projected nearby in the embedding space. In addition to their use in speech technology applications such as spoken term discovery and keyword spotting, AWE models have been adopted as models of spoken-word processing in several cognitively motivated studies and have been shown to exhibit human-like performance in some auditory processing tasks. Nevertheless, the representational geometry of AWEs remains an under-explored topic that has not been studied in the literature. In this paper, we take a closer analytical look at AWEs learned from English speech and study how the choice of the learning objective and the architecture shapes their representational profile. To this end, we employ a set of analytic techniques from machine learning and neuroscience in three different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
