Actor-Critic Sequence Training for Image Captioning

Li Zhang; Flood Sung; Feng Liu; Tao Xiang; Shaogang Gong; Yongxin; Yang; Timothy M. Hospedales

arXiv:1706.09601·cs.CV·November 29, 2017·99 cites

Actor-Critic Sequence Training for Image Captioning

Li Zhang, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin, Yang, Timothy M. Hospedales

PDF

Open Access

TL;DR

This paper introduces an actor-critic reinforcement learning approach for image captioning that directly optimizes language quality metrics, achieving state-of-the-art results on the MSCOCO benchmark.

Contribution

It proposes a novel actor-critic training method for image captioning that directly maximizes non-differentiable quality metrics, improving over traditional likelihood-based training.

Findings

01

Achieves state-of-the-art performance on MSCOCO

02

Directly optimizes CIDEr and other metrics

03

Outperforms likelihood-based training methods

Abstract

Generating natural language descriptions of images is an important capability for a robot or other visual-intelligence driven AI agent that may need to communicate with human users about what it is seeing. Such image captioning methods are typically trained by maximising the likelihood of ground-truth annotated caption given the image. While simple and easy to implement, this approach does not directly maximise the language quality metrics we care about such as CIDEr. In this paper we investigate training image captioning methods based on actor-critic reinforcement learning in order to directly optimise non-differentiable quality metrics of interest. By formulating a per-token advantage and value computation strategy in this novel reinforcement learning based captioning model, we show that it is possible to achieve the state of the art performance on the widely used MSCOCO benchmark.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques