CLEVRER: CoLlision Events for Video REpresentation and Reasoning
Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio, Torralba, Joshua B. Tenenbaum

TL;DR
CLEVRER introduces a diagnostic video dataset designed to evaluate models on causal reasoning tasks involving object collision events, highlighting the gap between perception and causal understanding in current models.
Contribution
The paper presents CLEVRER, a new dataset for systematic evaluation of causal reasoning in videos, and analyzes the performance gap of state-of-the-art models on causal tasks.
Findings
Models excel at perception-based tasks but struggle with causal reasoning.
Explicit symbolic models improve causal reasoning performance.
Current models lack integrated perception and causal understanding capabilities.
Abstract
The ability to reason about temporal and causal events from videos lies at the core of human intelligence. Most video reasoning benchmarks, however, focus on pattern recognition from complex visual and language input, instead of on causal structure. We study the complementary problem, exploring the temporal and causal structures behind videos of objects with simple visual appearance. To this end, we introduce the CoLlision Events for Video REpresentation and Reasoning (CLEVRER), a diagnostic video dataset for systematic evaluation of computational models on a wide range of reasoning tasks. Motivated by the theory of human casual judgment, CLEVRER includes four types of questions: descriptive (e.g., "what color"), explanatory ("what is responsible for"), predictive ("what will happen next"), and counterfactual ("what if"). We evaluate various state-of-the-art models for visual reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Human Pose and Action Recognition
