TL;DR
This paper demonstrates that a simple bi-modal model trained on children's cartoon data can learn visual semantics of spoken language despite weak and confounded training signals, addressing ecological validity issues in language acquisition models.
Contribution
It introduces a novel dataset based on Peppa Pig and shows that models can learn visual semantics from weakly correlated speech-visual data.
Findings
Model successfully learned visual semantics from weakly correlated data.
Training on children's cartoon dialogue is effective for language grounding.
Addresses ecological validity in computational language acquisition models.
Abstract
Recent computational models of the acquisition of spoken language via grounding in perception exploit associations between the spoken and visual modalities and learn to represent speech and visual data in a joint vector space. A major unresolved issue from the point of ecological validity is the training data, typically consisting of images or videos paired with spoken descriptions of what is depicted. Such a setup guarantees an unrealistically strong correlation between speech and the visual data. In the real world the coupling between the linguistic and the visual modality is loose, and often confounded by correlations with non-semantic aspects of the speech signal. Here we address this shortcoming by using a dataset based on the children's cartoon Peppa Pig. We train a simple bi-modal architecture on the portion of the data consisting of dialog between characters, and evaluate on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
