A Neural Compositional Paradigm for Image Captioning

Bo Dai; Sanja Fidler; Dahua Lin

arXiv:1810.09630·cs.CV·October 24, 2018·24 cites

A Neural Compositional Paradigm for Image Captioning

Bo Dai, Sanja Fidler, Dahua Lin

PDF

Open Access 1 Repo

TL;DR

This paper introduces a two-stage, compositional approach to image captioning that explicitly separates semantic extraction from caption generation, resulting in more accurate, diverse, and generalizable captions.

Contribution

It proposes a novel paradigm that factorizes captioning into semantic extraction and recursive compositional caption construction, improving diversity and generalization over traditional sequential models.

Findings

01

Better preservation of semantic content

02

Requires less training data

03

Produces more diverse captions

Abstract

Mainstream captioning models often follow a sequential structure to generate captions, leading to issues such as introduction of irrelevant semantics, lack of diversity in the generated captions, and inadequate generalization performance. In this paper, we present an alternative paradigm for image captioning, which factorizes the captioning procedure into two stages: (1) extracting an explicit semantic representation from the given image; and (2) constructing the caption based on a recursive compositional procedure in a bottom-up manner. Compared to conventional ones, our paradigm better preserves the semantic content through an explicit factorization of semantics and syntax. By using the compositional generation procedure, caption construction follows a recursive structure, which naturally fits the properties of human language. Moreover, the proposed compositional procedure requires…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ajaysub110/A-Neural-Compositional-Paradigm-for-Image-Captioning
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques