# Human Attention in Image Captioning: Dataset and Analysis

**Authors:** Sen He, Hamed R. Tavakoli, Ali Borji, Nicolas Pugeault

arXiv: 1903.02499 · 2019-08-08

## TL;DR

This paper introduces a new dataset capturing human eye movements and descriptions during image viewing and captioning, analyzes differences in attention, and explores how saliency integration can improve captioning models.

## Contribution

It provides a novel dataset of synchronized eye movements and descriptions, and analyzes human attention patterns and their relationship to machine attention in image captioning.

## Key findings

- Humans fixate on more regions during captioning than free-viewing.
- 97% of described objects are attended to by humans.
- Integrating saliency with soft attention improves captioning performance.

## Abstract

In this work, we present a novel dataset consisting of eye movements and verbal descriptions recorded synchronously over images. Using this data, we study the differences in human attention during free-viewing and image captioning tasks. We look into the relationship between human attention and language constructs during perception and sentence articulation. We also analyse attention deployment mechanisms in the top-down soft attention approach that is argued to mimic human attention in captioning tasks, and investigate whether visual saliency can help image captioning. Our study reveals that (1) human attention behaviour differs in free-viewing and image description tasks. Humans tend to fixate on a greater variety of regions under the latter task, (2) there is a strong relationship between described objects and attended objects ($97\%$ of the described objects are being attended), (3) a convolutional neural network as feature encoder accounts for human-attended regions during image captioning to a great extent (around $78\%$), (4) soft-attention mechanism differs from human attention, both spatially and temporally, and there is low correlation between caption scores and attention consistency scores. These indicate a large gap between humans and machines in regards to top-down attention, and (5) by integrating the soft attention model with image saliency, we can significantly improve the model's performance on Flickr30k and MSCOCO benchmarks. The dataset can be found at: https://github.com/SenHe/Human-Attention-in-Image-Captioning.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1903.02499/full.md

## Figures

39 figures with captions in the complete paper: https://tomesphere.com/paper/1903.02499/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/1903.02499/full.md

---
Source: https://tomesphere.com/paper/1903.02499