A Neural Compositional Paradigm for Image Captioning
Bo Dai, Sanja Fidler, Dahua Lin

TL;DR
This paper introduces a two-stage, compositional approach to image captioning that explicitly separates semantic extraction from caption generation, resulting in more accurate, diverse, and generalizable captions.
Contribution
It proposes a novel paradigm that factorizes captioning into semantic extraction and recursive compositional caption construction, improving diversity and generalization over traditional sequential models.
Findings
Better preservation of semantic content
Requires less training data
Produces more diverse captions
Abstract
Mainstream captioning models often follow a sequential structure to generate captions, leading to issues such as introduction of irrelevant semantics, lack of diversity in the generated captions, and inadequate generalization performance. In this paper, we present an alternative paradigm for image captioning, which factorizes the captioning procedure into two stages: (1) extracting an explicit semantic representation from the given image; and (2) constructing the caption based on a recursive compositional procedure in a bottom-up manner. Compared to conventional ones, our paradigm better preserves the semantic content through an explicit factorization of semantics and syntax. By using the compositional generation procedure, caption construction follows a recursive structure, which naturally fits the properties of human language. Moreover, the proposed compositional procedure requires…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
