Word Discovery in Visually Grounded, Self-Supervised Speech Models
Puyuan Peng, David Harwath

TL;DR
This paper introduces a visually-grounded self-supervised speech model that learns to discover words through visual association, outperforming existing methods on standard benchmarks.
Contribution
It demonstrates that visual grounding enables self-attention heads in speech models to effectively segment and cluster words, a capability absent in non-grounded models.
Findings
Models trained with visual grounding outperform baseline models.
Powerful word segmentation emerges in self-attention heads.
Achieves competitive results on Buckeye and ZeroSpeech tasks.
Abstract
We present a method for visually-grounded spoken term discovery. After training either a HuBERT or wav2vec2.0 model to associate spoken captions with natural images, we show that powerful word segmentation and clustering capability emerges within the model's self-attention heads. Our experiments reveal that this ability is not present to nearly the same extent in the base HuBERT and wav2vec2.0 models, suggesting that the visual grounding task is a crucial component of the word discovery capability we observe. We also evaluate our method on the Buckeye word segmentation and ZeroSpeech spoken term discovery tasks, where we perform on par with or better than currently published methods on several metrics. Code and model weights are available at https://github.com/jasonppy/word-discovery.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques
MethodsBalanced Selection
