From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning
Lieke Gelderloos, Grzegorz Chrupa{\l}a

TL;DR
This paper introduces a recurrent neural network model that learns to associate phoneme sequences with visual features, demonstrating hierarchical representation of linguistic information from form to meaning in a multimodal learning context.
Contribution
It presents a novel stacked gated recurrent neural network model that learns visually-grounded language from phoneme sequences, revealing hierarchical levels of linguistic representation.
Findings
Model successfully predicts visual features from phoneme sequences.
Lower network layers are sensitive to phonetic form.
Higher layers encode semantic meaning.
Abstract
We present a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes. The learning task resembles that faced by human language learners who need to discover both structure and meaning from noisy and ambiguous data across modalities. We show that our model indeed learns to predict features of the visual context given phonetically transcribed image descriptions, and show that it represents linguistic information in a hierarchy of levels: lower layers in the stack are comparatively more sensitive to form, whereas higher layers are more sensitive to meaning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Language, Metaphor, and Cognition
