Visual Commonsense-aware Representation Network for Video Captioning

Pengpeng Zeng; Haonan Zhang; Lianli Gao; Xiangpeng Li; Jin Qian; Heng; Tao Shen

arXiv:2211.09469·cs.CV·November 18, 2022

Visual Commonsense-aware Representation Network for Video Captioning

Pengpeng Zeng, Haonan Zhang, Lianli Gao, Xiangpeng Li, Jin Qian, Heng, Tao Shen

PDF

Open Access 1 Repo

TL;DR

This paper introduces VCRN, a novel video captioning model that leverages visual commonsense knowledge through a dataset-driven video dictionary and concept-based enhancements, achieving state-of-the-art results.

Contribution

The paper proposes a new method, VCRN, which incorporates a visual commonsense-aware representation using a dataset-derived video dictionary and concept integration for improved captioning.

Findings

01

Achieves state-of-the-art performance on MSVD, MSR-VTT, and VATEX benchmarks.

02

Enhances video captioning by integrating visual commonsense knowledge.

03

Improves video question answering performance when combined with existing methods.

Abstract

Generating consecutive descriptions for videos, i.e., Video Captioning, requires taking full advantage of visual representation along with the generation process. Existing video captioning methods focus on making an exploration of spatial-temporal representations and their relationships to produce inferences. However, such methods only exploit the superficial association contained in the video itself without considering the intrinsic visual commonsense knowledge that existed in a video dataset, which may hinder their capabilities of knowledge cognitive to reason accurate descriptions. To address this problem, we propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN), for video captioning. Specifically, we construct a Video Dictionary, a plug-and-play component, obtained by clustering all video features from the total dataset into multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zchoi/vcrn
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques