On Architectures for Including Visual Information in Neural Language Models for Image Description
Marc Tanti, Albert Gatt, Kenneth P. Camilleri

TL;DR
This paper compares four neural architectures for integrating visual information into language models for image captioning, finding that init-inject performs best initially, but merge maintains visual influence longer and benefits from transfer learning.
Contribution
It introduces and analyzes four main architectures for visual information integration in neural language models, identifying the most effective design and exploring transfer learning benefits.
Findings
init-inject architecture performs best initially
merge architecture retains visual influence longer
transfer learning improves performance for merge architecture
Abstract
A neural language model can be conditioned into generating descriptions for images by providing visual information apart from the sentence prefix. This visual information can be included into the language model through different points of entry resulting in different neural architectures. We identify four main architectures which we call init-inject, pre-inject, par-inject, and merge. We analyse these four architectures and conclude that the best performing one is init-inject, which is when the visual information is injected into the initial state of the recurrent neural network. We confirm this using both automatic evaluation measures and human annotation. We then analyse how much influence the images have on each architecture. This is done by measuring how different the output probabilities of a model are when a partial sentence is combined with a completely different image from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
