CLEVRER: CoLlision Events for Video REpresentation and Reasoning

Kexin Yi; Chuang Gan; Yunzhu Li; Pushmeet Kohli; Jiajun Wu; Antonio; Torralba; Joshua B. Tenenbaum

arXiv:1910.01442·cs.CV·March 10, 2020·70 cites

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio, Torralba, Joshua B. Tenenbaum

PDF

Open Access 3 Repos

TL;DR

CLEVRER introduces a diagnostic video dataset designed to evaluate models on causal reasoning tasks involving object collision events, highlighting the gap between perception and causal understanding in current models.

Contribution

The paper presents CLEVRER, a new dataset for systematic evaluation of causal reasoning in videos, and analyzes the performance gap of state-of-the-art models on causal tasks.

Findings

01

Models excel at perception-based tasks but struggle with causal reasoning.

02

Explicit symbolic models improve causal reasoning performance.

03

Current models lack integrated perception and causal understanding capabilities.

Abstract

The ability to reason about temporal and causal events from videos lies at the core of human intelligence. Most video reasoning benchmarks, however, focus on pattern recognition from complex visual and language input, instead of on causal structure. We study the complementary problem, exploring the temporal and causal structures behind videos of objects with simple visual appearance. To this end, we introduce the CoLlision Events for Video REpresentation and Reasoning (CLEVRER), a diagnostic video dataset for systematic evaluation of computational models on a wide range of reasoning tasks. Motivated by the theory of human casual judgment, CLEVRER includes four types of questions: descriptive (e.g., "what color"), explanatory ("what is responsible for"), predictive ("what will happen next"), and counterfactual ("what if"). We evaluate various state-of-the-art models for visual reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Human Pose and Action Recognition