Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech
William N. Havard, Jean-Pierre Chevrot, Laurent Besacier

TL;DR
This study investigates how a recurrent neural model of visually grounded speech implicitly segments input into word-like units, maps them to visual referents, and reveals insights into word activation and representation mechanisms.
Contribution
The paper introduces a linguistically inspired gating methodology to analyze neural representations, showing that word activation depends on initial phonemes and highlighting the role of specific speech frames.
Findings
Model implicitly segments speech into word-like units
Word activation depends on first phoneme access
Certain speech frames are crucial for word representation
Abstract
In this paper, we study how word-like units are represented and activated in a recurrent neural model of visually grounded speech. The model used in our experiments is trained to project an image and its spoken description in a common representation space. We show that a recurrent model trained on spoken sentences implicitly segments its input into word-like units and reliably maps them to their correct visual referents. We introduce a methodology originating from linguistics to analyse the representation learned by neural networks -- the gating paradigm -- and show that the correct representation of a word is only activated if the network has access to first phoneme of the target word, suggesting that the network does not rely on a global acoustic pattern. Furthermore, we find out that not all speech frames (MFCC vectors in our case) play an equal role in the final encoded…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
