iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video   Captioning and Video Question Answering

Aman Chadha; Gurneet Arora; Navpreet Kaloty

arXiv:2011.07735·cs.CV·November 17, 2020·24 cites

iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering

Aman Chadha, Gurneet Arora, Navpreet Kaloty

PDF

Open Access

TL;DR

iPerceive introduces a framework that incorporates common-sense reasoning and multiple modalities to improve dense video captioning and question answering by understanding causal relationships between events.

Contribution

The paper presents a novel approach that integrates common-sense knowledge and multi-modal data for enhanced video understanding tasks.

Findings

01

Outperforms previous methods on ActivityNet Captions dataset

02

Achieves state-of-the-art results on TVQA dataset

03

Demonstrates the importance of causal reasoning and multi-modal integration

Abstract

Most prior art in visual understanding relies solely on analyzing the "what" (e.g., event recognition) and "where" (e.g., event localization), which in some cases, fails to describe correct contextual relationships between events or leads to incorrect underlying visual attention. Part of what defines us as human and fundamentally different from machines is our instinct to seek causality behind any association, say an event Y that happened as a direct result of event X. To this end, we propose iPerceive, a framework capable of understanding the "why" between events in a video by building a common-sense knowledge base using contextual cues to infer causal relationships between objects in the video. We demonstrate the effectiveness of our technique using the dense video captioning (DVC) and video question answering (VideoQA) tasks. Furthermore, while most prior work in DVC and VideoQA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization