CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos
Xuchen Li, Xuzhao Li, Shiyu Hu, Kaiqi Huang, Wentao Zhang

TL;DR
CausalStep is a new benchmark for evaluating explicit stepwise causal reasoning in videos, designed to challenge models with causally linked questions and diagnostic metrics, revealing gaps in current AI reasoning capabilities.
Contribution
We introduce CausalStep, a benchmark with causally linked video segments, stepwise QA protocol, distractors, and diagnostic metrics to rigorously assess causal reasoning in video understanding.
Findings
Current models lag behind human reasoning on CausalStep
CausalStep reveals limitations of existing video reasoning models
Benchmark enables detailed diagnosis of causal reasoning skills
Abstract
Recent advances in large language models (LLMs) have improved reasoning in text and image domains, yet achieving robust video reasoning remains a significant challenge. Existing video benchmarks mainly assess shallow understanding and reasoning and allow models to exploit global context, failing to rigorously evaluate true causal and stepwise reasoning. We present CausalStep, a benchmark designed for explicit stepwise causal reasoning in videos. CausalStep segments videos into causally linked units and enforces a strict stepwise question-answer (QA) protocol, requiring sequential answers and preventing shortcut solutions. Each question includes carefully constructed distractors based on error type taxonomy to ensure diagnostic value. The benchmark features 100 videos across six categories and 1,852 multiple-choice QA pairs. We introduce seven diagnostic metrics for comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Games · Adversarial Robustness in Machine Learning
