Analyzing Acoustic Word Embeddings from Pre-trained Self-supervised Speech Models
Ramon Sanabria, Hao Tang, Sharon Goldwater

TL;DR
This paper investigates the use of pre-trained self-supervised speech models for creating acoustic word embeddings, demonstrating that simple pooling methods like averaging can produce competitive results across multiple languages.
Contribution
It is the first comprehensive study comparing various pre-trained models and pooling techniques for acoustic word embeddings, highlighting HuBERT's effectiveness even across languages.
Findings
HuBERT with mean pooling rivals state-of-the-art on English AWEs.
HuBERT outperforms XLSR-53 and Wav2Vec 2.0 on non-English languages.
Simple pooling methods are effective for constructing AWEs from self-supervised models.
Abstract
Given the strong results of self-supervised models on various tasks, there have been surprisingly few studies exploring self-supervised representations for acoustic word embeddings (AWE), fixed-dimensional vectors representing variable-length spoken word segments. In this work, we study several pre-trained models and pooling methods for constructing AWEs with self-supervised representations. Owing to the contextualized nature of self-supervised representations, we hypothesize that simple pooling methods, such as averaging, might already be useful for constructing AWEs. When evaluating on a standard word discrimination task, we find that HuBERT representations with mean-pooling rival the state of the art on English AWEs. More surprisingly, despite being trained only on English, HuBERT representations evaluated on Xitsonga, Mandarin, and French consistently outperform the multilingual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
