Consensus-based Sequence Training for Video Captioning

Sang Phan; Gustav Eje Henter; Yusuke Miyao; Shin'ichi Satoh

arXiv:1712.09532·cs.CV·December 29, 2017·19 cites

Consensus-based Sequence Training for Video Captioning

Sang Phan, Gustav Eje Henter, Yusuke Miyao, Shin'ichi Satoh

PDF

Open Access

TL;DR

This paper introduces Consensus-based Sequence Training (CST), a fast reinforcement learning approach for video captioning that leverages ground-truth caption consensus to optimize evaluation metrics directly, achieving state-of-the-art results.

Contribution

The paper proposes a novel, efficient reinforcement learning method for video captioning that uses ground-truth caption consensus as a baseline, significantly improving training speed and performance.

Findings

01

Training speed is significantly improved compared to previous RL methods.

02

Achieved new state-of-the-art CIDEr score of 54.2 on MSRVTT.

03

The method effectively optimizes captioning metrics directly.

Abstract

Captioning models are typically trained using the cross-entropy loss. However, their performance is evaluated on other metrics designed to better correlate with human assessments. Recently, it has been shown that reinforcement learning (RL) can directly optimize these metrics in tasks such as captioning. However, this is computationally costly and requires specifying a baseline reward at each step to make training converge. We propose a fast approach to optimize one's objective of interest through the REINFORCE algorithm. First we show that, by replacing model samples with ground-truth sentences, RL training can be seen as a form of weighted cross-entropy loss, giving a fast, RL-based pre-training algorithm. Second, we propose to use the consensus among ground-truth captions of the same video as the baseline reward. This can be computed very efficiently. We call the complete proposal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization