TL;DR
This paper presents a meta-learning framework that enables rapid and robust learning of word representations from visual scenes, leveraging compositional language structure and visual information, without pre-training.
Contribution
It introduces a novel meta-learning approach for language acquisition from visual scenes that outperforms baselines and is data-efficient, learning from scratch without pre-trained models.
Findings
Faster acquisition of novel words.
Improved generalization to unseen compositions.
Effective learning from minimal examples.
Abstract
Language acquisition is the process of learning words from the surrounding scene. We introduce a meta-learning framework that learns how to learn word representations from unconstrained scenes. We leverage the natural compositional structure of language to create training episodes that cause a meta-learner to learn strong policies for language acquisition. Experiments on two datasets show that our approach is able to more rapidly acquire novel words as well as more robustly generalize to unseen compositions, significantly outperforming established baselines. A key advantage of our approach is that it is data efficient, allowing representations to be learned from scratch without language pre-training. Visualizations and analysis suggest visual information helps our approach learn a rich cross-modal representation from minimal examples. Project webpage is available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
