Rethinking the Form of Latent States in Image Captioning

Bo Dai; Deming Ye; and Dahua Lin

arXiv:1807.09958·cs.CV·August 15, 2018

Rethinking the Form of Latent States in Image Captioning

Bo Dai, Deming Ye, and Dahua Lin

PDF

Open Access

TL;DR

This paper proposes using two-dimensional maps for latent states in image captioning models, which improves performance and preserves spatial locality, offering new insights into caption generation dynamics.

Contribution

It introduces a novel 2D latent state formulation for image captioning, demonstrating its effectiveness over traditional vector-based states.

Findings

01

2D states outperform vector states in captioning accuracy

02

2D states maintain spatial locality in latent representations

03

Visual analysis reveals internal caption generation dynamics

Abstract

RNNs and their variants have been widely adopted for image captioning. In RNNs, the production of a caption is driven by a sequence of latent states. Existing captioning models usually represent latent states as vectors, taking this practice for granted. We rethink this choice and study an alternative formulation, namely using two-dimensional maps to encode latent states. This is motivated by the curiosity about a question: how the spatial structures in the latent states affect the resultant captions? Our study on MSCOCO and Flickr30k leads to two significant observations. First, the formulation with 2D states is generally more effective in captioning, consistently achieving higher performance with comparable parameter sizes. Second, 2D states preserve spatial locality. Taking advantage of this, we visually reveal the internal dynamics in the process of caption generation, as well as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques