Analyzing Acoustic Word Embeddings from Pre-trained Self-supervised   Speech Models

Ramon Sanabria; Hao Tang; Sharon Goldwater

arXiv:2210.16043·cs.CL·March 16, 2023·1 cites

Analyzing Acoustic Word Embeddings from Pre-trained Self-supervised Speech Models

Ramon Sanabria, Hao Tang, Sharon Goldwater

PDF

Open Access

TL;DR

This paper investigates the use of pre-trained self-supervised speech models for creating acoustic word embeddings, demonstrating that simple pooling methods like averaging can produce competitive results across multiple languages.

Contribution

It is the first comprehensive study comparing various pre-trained models and pooling techniques for acoustic word embeddings, highlighting HuBERT's effectiveness even across languages.

Findings

01

HuBERT with mean pooling rivals state-of-the-art on English AWEs.

02

HuBERT outperforms XLSR-53 and Wav2Vec 2.0 on non-English languages.

03

Simple pooling methods are effective for constructing AWEs from self-supervised models.

Abstract

Given the strong results of self-supervised models on various tasks, there have been surprisingly few studies exploring self-supervised representations for acoustic word embeddings (AWE), fixed-dimensional vectors representing variable-length spoken word segments. In this work, we study several pre-trained models and pooling methods for constructing AWEs with self-supervised representations. Owing to the contextualized nature of self-supervised representations, we hypothesize that simple pooling methods, such as averaging, might already be useful for constructing AWEs. When evaluating on a standard word discrimination task, we find that HuBERT representations with mean-pooling rival the state of the art on English AWEs. More surprisingly, despite being trained only on English, HuBERT representations evaluated on Xitsonga, Mandarin, and French consistently outperform the multilingual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing