TL;DR
This paper compares two architectures for neural image captioning, finding that using RNNs as encoders rather than direct generators leads to better performance in generating captions.
Contribution
It challenges the dominant view by empirically demonstrating that late merging of image and linguistic features outperforms direct injection in RNN-based captioning.
Findings
Late merging architecture outperforms injection approach.
RNNs are more effective as encoders of linguistic features.
Empirical evidence supports rethinking RNN role in caption generators.
Abstract
In neural image captioning systems, a recurrent neural network (RNN) is typically viewed as the primary `generation' component. This view suggests that the image features should be `injected' into the RNN. This is in fact the dominant view in the literature. Alternatively, the RNN can instead be viewed as only encoding the previously generated words. This view suggests that the RNN should only be used to encode linguistic features and that only the final representation should be `merged' with the image features at a later stage. This paper compares these two architectures. We find that, in general, late merging outperforms injection, suggesting that RNNs are better viewed as encoders, rather than generators.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
