VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models
Pritam Sarkar, Ali Etemad

TL;DR
This paper introduces VCRBench, a new benchmark for evaluating long-form causal reasoning in large video language models, highlighting their current limitations and proposing a modular recognition-reasoning approach to improve performance.
Contribution
The paper presents VCRBench, a novel benchmark for video causal reasoning, and proposes Recognition-Reasoning Decomposition (RRD), a modular method that significantly enhances LVLMs' reasoning capabilities.
Findings
LVLMs struggle with long-range causal dependencies in videos.
RRD improves accuracy on VCRBench by up to 25.2%.
LVLMs mainly rely on language knowledge rather than visual reasoning.
Abstract
Despite recent advances in video understanding, the capabilities of Large Video Language Models (LVLMs) to perform video-based causal reasoning remains underexplored, largely due to the absence of relevant and dedicated benchmarks for evaluating causal reasoning in visually grounded and goal-driven settings. To fill this gap, we introduce a novel benchmark named Video-based long-form Causal Reasoning (VCRBench). We create VCRBench using procedural videos of simple everyday activities, where the steps are deliberately shuffled with each clip capturing a key causal event, to test whether LVLMs can identify, reason about, and correctly sequence the events needed to accomplish a specific goal. Moreover, the benchmark is carefully designed to prevent LVLMs from exploiting linguistic shortcuts, as seen in multiple-choice or binary QA formats, while also avoiding the challenges associated with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
