A Computational Acquisition Model for Multimodal Word Categorization
Uri Berger, Gabriel Stanovsky, Omri Abend, Lea Frermann

TL;DR
This paper introduces a cognitively-inspired multimodal model trained on naturalistic image-caption data, demonstrating its ability to learn word categories and object recognition in a manner similar to child language development.
Contribution
The study presents a novel cross-modal self-supervised model trained on naturalistic data, addressing limitations of previous vision-based models and aligning with developmental findings.
Findings
Model learns word categories effectively
Demonstrates object recognition abilities
Shows developmental trends similar to children
Abstract
Recent advances in self-supervised modeling of text and images open new opportunities for computational models of child language acquisition, which is believed to rely heavily on cross-modal signals. However, prior studies have been limited by their reliance on vision models trained on large image datasets annotated with a pre-defined set of depicted object categories. This is (a) not faithful to the information children receive and (b) prohibits the evaluation of such models with respect to category learning tasks, due to the pre-imposed category structure. We address this gap, and present a cognitively-inspired, multimodal acquisition model, trained from image-caption pairs on naturalistic data using cross-modal self-supervision. We show that the model learns word categories and object recognition abilities, and presents trends reminiscent of those reported in the developmental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
