Visual Commonsense-aware Representation Network for Video Captioning
Pengpeng Zeng, Haonan Zhang, Lianli Gao, Xiangpeng Li, Jin Qian, Heng, Tao Shen

TL;DR
This paper introduces VCRN, a novel video captioning model that leverages visual commonsense knowledge through a dataset-driven video dictionary and concept-based enhancements, achieving state-of-the-art results.
Contribution
The paper proposes a new method, VCRN, which incorporates a visual commonsense-aware representation using a dataset-derived video dictionary and concept integration for improved captioning.
Findings
Achieves state-of-the-art performance on MSVD, MSR-VTT, and VATEX benchmarks.
Enhances video captioning by integrating visual commonsense knowledge.
Improves video question answering performance when combined with existing methods.
Abstract
Generating consecutive descriptions for videos, i.e., Video Captioning, requires taking full advantage of visual representation along with the generation process. Existing video captioning methods focus on making an exploration of spatial-temporal representations and their relationships to produce inferences. However, such methods only exploit the superficial association contained in the video itself without considering the intrinsic visual commonsense knowledge that existed in a video dataset, which may hinder their capabilities of knowledge cognitive to reason accurate descriptions. To address this problem, we propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN), for video captioning. Specifically, we construct a Video Dictionary, a plug-and-play component, obtained by clustering all video features from the total dataset into multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
