What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption   Generator?

Marc Tanti; Albert Gatt; Kenneth P. Camilleri

arXiv:1708.02043·cs.CL·August 28, 2017

What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?

Marc Tanti, Albert Gatt, Kenneth P. Camilleri

PDF

4 Repos

TL;DR

This paper compares two architectures for neural image captioning, finding that using RNNs as encoders rather than direct generators leads to better performance in generating captions.

Contribution

It challenges the dominant view by empirically demonstrating that late merging of image and linguistic features outperforms direct injection in RNN-based captioning.

Findings

01

Late merging architecture outperforms injection approach.

02

RNNs are more effective as encoders of linguistic features.

03

Empirical evidence supports rethinking RNN role in caption generators.

Abstract

In neural image captioning systems, a recurrent neural network (RNN) is typically viewed as the primary `generation' component. This view suggests that the image features should be `injected' into the RNN. This is in fact the dominant view in the literature. Alternatively, the RNN can instead be viewed as only encoding the previously generated words. This view suggests that the RNN should only be used to encode linguistic features and that only the final representation should be `merged' with the image features at a later stage. This paper compares these two architectures. We find that, in general, late merging outperforms injection, suggesting that RNNs are better viewed as encoders, rather than generators.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.