TL;DR
This paper investigates the ability of image captioning models to generalize to unseen concept combinations, revealing poor performance of current models and proposing a multi-task approach that significantly improves compositional generalization.
Contribution
The paper introduces a multi-task model combining captioning and ranking with a re-ranking decoding mechanism to enhance compositional generalization in image captioning.
Findings
The proposed model outperforms state-of-the-art models on unseen concept combinations.
Current models show poor generalization to novel concept compositions.
Multi-task training improves the ability to generalize in image captioning.
Abstract
Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image--sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
