Learning to generalize to new compositions in image understanding

Yuval Atzmon; Jonathan Berant; Vahid Kezami; Amir Globerson; Gal; Chechik

arXiv:1608.07639·cs.CV·August 30, 2016·54 cites

Learning to generalize to new compositions in image understanding

Yuval Atzmon, Jonathan Berant, Vahid Kezami, Amir Globerson, Gal, Chechik

PDF

Open Access

TL;DR

This paper compares structured representations and recurrent neural networks for image captioning, demonstrating that structured models generalize better to new scene compositions and proposing a benchmark for compositional generalization.

Contribution

It introduces structured representations for image captioning, showing improved generalization to new compositions over LSTM-based models, and advocates for compositional benchmarks.

Findings

01

Structured models outperform LSTMs in compositional generalization (~7x accuracy)

02

Structured representations enable quantification of generalization to unseen combinations

03

Proposes compositional splits as a benchmark for image captioning

Abstract

Recurrent neural networks have recently been used for learning to describe images using natural language. However, it has been observed that these models generalize poorly to scenes that were not observed during training, possibly depending too strongly on the statistics of the text in the training data. Here we propose to describe images using short structured representations, aiming to capture the crux of a description. These structured representations allow us to tease-out and evaluate separately two types of generalization: standard generalization to new images with similar scenes, and generalization to new combinations of known entities. We compare two learning approaches on the MS-COCO dataset: a state-of-the-art recurrent network based on an LSTM (Show, Attend and Tell), and a simple structured prediction model on top of a deep network. We find that the structured model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory