On Architectures for Including Visual Information in Neural Language   Models for Image Description

Marc Tanti; Albert Gatt; Kenneth P. Camilleri

arXiv:1911.03738·cs.NE·November 12, 2019·1 cites

On Architectures for Including Visual Information in Neural Language Models for Image Description

Marc Tanti, Albert Gatt, Kenneth P. Camilleri

PDF

Open Access 1 Repo

TL;DR

This paper compares four neural architectures for integrating visual information into language models for image captioning, finding that init-inject performs best initially, but merge maintains visual influence longer and benefits from transfer learning.

Contribution

It introduces and analyzes four main architectures for visual information integration in neural language models, identifying the most effective design and exploring transfer learning benefits.

Findings

01

init-inject architecture performs best initially

02

merge architecture retains visual influence longer

03

transfer learning improves performance for merge architecture

Abstract

A neural language model can be conditioned into generating descriptions for images by providing visual information apart from the sentence prefix. This visual information can be included into the language model through different points of entry resulting in different neural architectures. We identify four main architectures which we call init-inject, pre-inject, par-inject, and merge. We analyse these four architectures and conclude that the best performing one is init-inject, which is when the visual information is injected into the initial state of the recurrent neural network. We confirm this using both automatic evaluation measures and human annotation. We then analyse how much influence the images have on each architecture. This is done by measuring how different the output probabilities of a model are when a partial sentence is combined with a completely different image from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mtanti/mtanti-phd
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques