The Role of Syntactic Planning in Compositional Image Captioning

Emanuele Bugliarello; Desmond Elliott

arXiv:2101.11911·cs.CL·January 29, 2021

The Role of Syntactic Planning in Compositional Image Captioning

Emanuele Bugliarello, Desmond Elliott

PDF

1 Repo

TL;DR

This paper explores how planning the syntactic structure of image captions can improve models' ability to generalize to unseen compositions, enhancing both compositional generalization and standard captioning metrics.

Contribution

It introduces methods that jointly model tokens and syntactic tags to improve compositional generalization in image captioning models.

Findings

01

Joint modeling of tokens and syntactic tags improves generalization.

02

Syntactic planning enhances performance on standard captioning metrics.

03

Methods outperform baseline models on compositional generalization datasets.

Abstract

Image captioning has focused on generalizing to images drawn from the same distribution as the training set, and not to the more challenging problem of generalizing to different distributions of images. Recently, Nikolaus et al. (2019) introduced a dataset to assess compositional generalization in image captioning, where models are evaluated on their ability to describe images with unseen adjective-noun and noun-verb compositions. In this work, we investigate different methods to improve compositional generalization by planning the syntactic structure of a caption. Our experiments show that jointly modeling tokens and syntactic tags enhances generalization in both RNN- and Transformer-based models, while also improving performance on standard metrics.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

e-bug/syncap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.