Video2Commonsense: Generating Commonsense Descriptions to Enrich Video   Captioning

Zhiyuan Fang; Tejas Gokhale; Pratyay Banerjee; Chitta Baral; Yezhou; Yang

arXiv:2003.05162·cs.CV·January 10, 2023·5 cites

Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

Zhiyuan Fang, Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou, Yang

PDF

Open Access 2 Repos

TL;DR

This paper introduces a novel approach for generating commonsense descriptions in video captioning, emphasizing latent social aspects like intentions and effects, supported by a new dataset and question answering methods.

Contribution

It presents the first method for directly generating commonsense captions from videos and introduces the Video-to-Commonsense (V2C) dataset with annotations for social aspects.

Findings

01

Created the V2C dataset with 9k videos and annotations

02

Demonstrated that commonsense captioning enriches video understanding

03

Explored open-ended video-based commonsense question answering

Abstract

Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling