Reinforced Video Captioning with Entailment Rewards
Ramakanth Pasunuru, Mohit Bansal

TL;DR
This paper introduces a reinforcement learning approach with an entailment-based reward for video captioning, significantly improving sentence-level metrics and achieving state-of-the-art results on MSR-VTT.
Contribution
It proposes a novel entailment-enhanced reward (CIDEnt) for reinforcement learning in video captioning, improving logical consistency and metric performance.
Findings
Significant improvements over baseline in automatic and human evaluations.
CIDEnt reward outperforms CIDEr-based rewards.
Achieves new state-of-the-art on MSR-VTT dataset.
Abstract
Sequence-to-sequence models have shown promising improvements on the temporal task of video captioning, but they optimize word-level cross-entropy loss during training. First, using policy gradient and mixed-loss methods for reinforcement learning, we directly optimize sentence-level task-based metrics (as rewards), achieving significant improvements over the baseline, based on both automatic metrics and human evaluation on multiple datasets. Next, we propose a novel entailment-enhanced reward (CIDEnt) that corrects phrase-matching based metrics (such as CIDEr) to only allow for logically-implied partial matches and avoid contradictions, achieving further significant improvements over the CIDEr-reward model. Overall, our CIDEnt-reward model achieves the new state-of-the-art on the MSR-VTT dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
