# Multi-modal gated recurrent units for image description

**Authors:** Xuelong Li, Aihong Yuan, Xiaoqiang Lu

arXiv: 1904.09421 · 2019-04-23

## TL;DR

This paper introduces a multi-modal gated recurrent unit model that effectively generates descriptive sentences for images by learning inter-modal relations, achieving state-of-the-art results on multiple datasets.

## Contribution

The paper proposes a novel multi-modal GRU model that integrates image features and sentence representations for improved image captioning performance.

## Key findings

- Achieves state-of-the-art results on Flickr8K, Flickr30K, and MS COCO datasets.
- Effectively models inter-modal relations between images and sentences.
- Generates relevant, grammatically correct descriptions for images.

## Abstract

Using a natural language sentence to describe the content of an image is a challenging but very important task. It is challenging because a description must not only capture objects contained in the image and the relationships among them, but also be relevant and grammatically correct. In this paper a multi-modal embedding model based on gated recurrent units (GRU) which can generate variable-length description for a given image. In the training step, we apply the convolutional neural network (CNN) to extract the image feature. Then the feature is imported into the multi-modal GRU as well as the corresponding sentence representations. The multi-modal GRU learns the inter-modal relations between image and sentence. And in the testing step, when an image is imported to our multi-modal GRU model, a sentence which describes the image content is generated. The experimental results demonstrate that our multi-modal GRU model obtains the state-of-the-art performance on Flickr8K, Flickr30K and MS COCO datasets.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.09421/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/1904.09421/full.md

## References

71 references — full list in the complete paper: https://tomesphere.com/paper/1904.09421/full.md

---
Source: https://tomesphere.com/paper/1904.09421