Learning to generalize to new compositions in image understanding
Yuval Atzmon, Jonathan Berant, Vahid Kezami, Amir Globerson, Gal, Chechik

TL;DR
This paper compares structured representations and recurrent neural networks for image captioning, demonstrating that structured models generalize better to new scene compositions and proposing a benchmark for compositional generalization.
Contribution
It introduces structured representations for image captioning, showing improved generalization to new compositions over LSTM-based models, and advocates for compositional benchmarks.
Findings
Structured models outperform LSTMs in compositional generalization (~7x accuracy)
Structured representations enable quantification of generalization to unseen combinations
Proposes compositional splits as a benchmark for image captioning
Abstract
Recurrent neural networks have recently been used for learning to describe images using natural language. However, it has been observed that these models generalize poorly to scenes that were not observed during training, possibly depending too strongly on the statistics of the text in the training data. Here we propose to describe images using short structured representations, aiming to capture the crux of a description. These structured representations allow us to tease-out and evaluate separately two types of generalization: standard generalization to new images with similar scenes, and generalization to new combinations of known entities. We compare two learning approaches on the MS-COCO dataset: a state-of-the-art recurrent network based on an LSTM (Show, Attend and Tell), and a simple structured prediction model on top of a deep network. We find that the structured model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
