Learning to Learn Words from Visual Scenes

D\'idac Sur\'is; Dave Epstein; Heng Ji; Shih-Fu Chang; Carl Vondrick

arXiv:1911.11237·cs.CL·July 14, 2020

Learning to Learn Words from Visual Scenes

D\'idac Sur\'is, Dave Epstein, Heng Ji, Shih-Fu Chang, Carl Vondrick

PDF

1 Repo

TL;DR

This paper presents a meta-learning framework that enables rapid and robust learning of word representations from visual scenes, leveraging compositional language structure and visual information, without pre-training.

Contribution

It introduces a novel meta-learning approach for language acquisition from visual scenes that outperforms baselines and is data-efficient, learning from scratch without pre-trained models.

Findings

01

Faster acquisition of novel words.

02

Improved generalization to unseen compositions.

03

Effective learning from minimal examples.

Abstract

Language acquisition is the process of learning words from the surrounding scene. We introduce a meta-learning framework that learns how to learn word representations from unconstrained scenes. We leverage the natural compositional structure of language to create training episodes that cause a meta-learner to learn strong policies for language acquisition. Experiments on two datasets show that our approach is able to more rapidly acquire novel words as well as more robustly generalize to unseen compositions, significantly outperforming established baselines. A key advantage of our approach is that it is data efficient, allowing representations to be learned from scratch without language pre-training. Visualizations and analysis suggest visual information helps our approach learn a rich cross-modal representation from minimal examples. Project webpage is available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cvlab-columbia/expert
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.