Injecting Prior Knowledge into Image Caption Generation

Arushi Goel; Basura Fernando; Thanh-Son Nguyen; and Hakan Bilen

arXiv:1911.10082·cs.CL·August 7, 2020

Injecting Prior Knowledge into Image Caption Generation

Arushi Goel, Basura Fernando, Thanh-Son Nguyen, and Hakan Bilen

PDF

TL;DR

This paper introduces a novel approach to enhance image captioning by integrating prior knowledge through latent topics and regularization, resulting in more accurate and human-like captions, especially in low-data scenarios.

Contribution

It proposes a new method combining latent topic attention and regularization to improve image captioning performance and interpretability.

Findings

01

Significant improvement on MSCOCO dataset

02

Enhanced caption interpretability

03

Better performance in low-data regimes

Abstract

Automatically generating natural language descriptions from an image is a challenging problem in artificial intelligence that requires a good understanding of the visual and textual signals and the correlations between them. The state-of-the-art methods in image captioning struggles to approach human level performance, especially when data is limited. In this paper, we propose to improve the performance of the state-of-the-art image captioning models by incorporating two sources of prior knowledge: (i) a conditional latent topic attention, that uses a set of latent variables (topics) as an anchor to generate highly probable words and, (ii) a regularization technique that exploits the inductive biases in syntactic and semantic structure of captions and improves the generalization of image captioning models. Our experiments validate that our method produces more human interpretable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.