Less Is More: Picking Informative Frames for Video Captioning
Yangyu Chen, Shuhui Wang, Weigang Zhang, Qingming Huang

TL;DR
This paper introduces PickNet, a reinforcement-learning-based method for selecting informative frames in video captioning, reducing redundancy and computation while maintaining high performance.
Contribution
It proposes a novel plug-and-play frame picking module that enhances video captioning by selecting diverse and relevant frames using reinforcement learning.
Findings
Uses 6-8 frames to achieve competitive results
Reduces computational cost without performance loss
Improves robustness by selecting informative frames
Abstract
In video captioning task, the best practice has been achieved by attention-based models which associate salient visual components with sentences in the video. However, existing study follows a common procedure which includes a frame-level appearance modeling and motion modeling on equal interval frame sampling, which may bring about redundant visual information, sensitivity to content noise and unnecessary computation cost. We propose a plug-and-play PickNet to perform informative frame picking in video captioning. Based on a standard Encoder-Decoder framework, we develop a reinforcement-learning-based procedure to train the network sequentially, where the reward of each frame picking action is designed by maximizing visual diversity and minimizing textual discrepancy. If the candidate is rewarded, it will be selected and the corresponding latent representation of Encoder-Decoder will…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
