Word Discovery in Visually Grounded, Self-Supervised Speech Models

Puyuan Peng; David Harwath

arXiv:2203.15081·eess.AS·June 21, 2023

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Puyuan Peng, David Harwath

PDF

Open Access 3 Repos

TL;DR

This paper introduces a visually-grounded self-supervised speech model that learns to discover words through visual association, outperforming existing methods on standard benchmarks.

Contribution

It demonstrates that visual grounding enables self-attention heads in speech models to effectively segment and cluster words, a capability absent in non-grounded models.

Findings

01

Models trained with visual grounding outperform baseline models.

02

Powerful word segmentation emerges in self-attention heads.

03

Achieves competitive results on Buckeye and ZeroSpeech tasks.

Abstract

We present a method for visually-grounded spoken term discovery. After training either a HuBERT or wav2vec2.0 model to associate spoken captions with natural images, we show that powerful word segmentation and clustering capability emerges within the model's self-attention heads. Our experiments reveal that this ability is not present to nearly the same extent in the base HuBERT and wav2vec2.0 models, suggesting that the visual grounding task is a crucial component of the word discovery capability we observe. We also evaluate our method on the Buckeye word segmentation and ZeroSpeech spoken term discovery tasks, where we perform on par with or better than currently published methods on several metrics. Code and model weights are available at https://github.com/jasonppy/word-discovery.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques

MethodsBalanced Selection