Actor-Critic Sequence Training for Image Captioning
Li Zhang, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin, Yang, Timothy M. Hospedales

TL;DR
This paper introduces an actor-critic reinforcement learning approach for image captioning that directly optimizes language quality metrics, achieving state-of-the-art results on the MSCOCO benchmark.
Contribution
It proposes a novel actor-critic training method for image captioning that directly maximizes non-differentiable quality metrics, improving over traditional likelihood-based training.
Findings
Achieves state-of-the-art performance on MSCOCO
Directly optimizes CIDEr and other metrics
Outperforms likelihood-based training methods
Abstract
Generating natural language descriptions of images is an important capability for a robot or other visual-intelligence driven AI agent that may need to communicate with human users about what it is seeing. Such image captioning methods are typically trained by maximising the likelihood of ground-truth annotated caption given the image. While simple and easy to implement, this approach does not directly maximise the language quality metrics we care about such as CIDEr. In this paper we investigate training image captioning methods based on actor-critic reinforcement learning in order to directly optimise non-differentiable quality metrics of interest. By formulating a per-token advantage and value computation strategy in this novel reinforcement learning based captioning model, we show that it is possible to achieve the state of the art performance on the widely used MSCOCO benchmark.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
