Less Is More: Picking Informative Frames for Video Captioning

Yangyu Chen; Shuhui Wang; Weigang Zhang; Qingming Huang

arXiv:1803.01457·cs.CV·March 6, 2018·26 cites

Less Is More: Picking Informative Frames for Video Captioning

Yangyu Chen, Shuhui Wang, Weigang Zhang, Qingming Huang

PDF

Open Access

TL;DR

This paper introduces PickNet, a reinforcement-learning-based method for selecting informative frames in video captioning, reducing redundancy and computation while maintaining high performance.

Contribution

It proposes a novel plug-and-play frame picking module that enhances video captioning by selecting diverse and relevant frames using reinforcement learning.

Findings

01

Uses 6-8 frames to achieve competitive results

02

Reduces computational cost without performance loss

03

Improves robustness by selecting informative frames

Abstract

In video captioning task, the best practice has been achieved by attention-based models which associate salient visual components with sentences in the video. However, existing study follows a common procedure which includes a frame-level appearance modeling and motion modeling on equal interval frame sampling, which may bring about redundant visual information, sensitivity to content noise and unnecessary computation cost. We propose a plug-and-play PickNet to perform informative frame picking in video captioning. Based on a standard Encoder-Decoder framework, we develop a reinforcement-learning-based procedure to train the network sequentially, where the reward of each frame picking action is designed by maximizing visual diversity and minimizing textual discrepancy. If the candidate is rewarded, it will be selected and the corresponding latent representation of Encoder-Decoder will…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques