Consensus-based Sequence Training for Video Captioning
Sang Phan, Gustav Eje Henter, Yusuke Miyao, Shin'ichi Satoh

TL;DR
This paper introduces Consensus-based Sequence Training (CST), a fast reinforcement learning approach for video captioning that leverages ground-truth caption consensus to optimize evaluation metrics directly, achieving state-of-the-art results.
Contribution
The paper proposes a novel, efficient reinforcement learning method for video captioning that uses ground-truth caption consensus as a baseline, significantly improving training speed and performance.
Findings
Training speed is significantly improved compared to previous RL methods.
Achieved new state-of-the-art CIDEr score of 54.2 on MSRVTT.
The method effectively optimizes captioning metrics directly.
Abstract
Captioning models are typically trained using the cross-entropy loss. However, their performance is evaluated on other metrics designed to better correlate with human assessments. Recently, it has been shown that reinforcement learning (RL) can directly optimize these metrics in tasks such as captioning. However, this is computationally costly and requires specifying a baseline reward at each step to make training converge. We propose a fast approach to optimize one's objective of interest through the REINFORCE algorithm. First we show that, by replacing model samples with ground-truth sentences, RL training can be seen as a form of weighted cross-entropy loss, giving a fast, RL-based pre-training algorithm. Second, we propose to use the consensus among ground-truth captions of the same video as the baseline reward. This can be computed very efficiently. We call the complete proposal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
