VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models

Pritam Sarkar; Ali Etemad

arXiv:2505.08455·cs.CV·May 14, 2025

VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models

Pritam Sarkar, Ali Etemad

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces VCRBench, a new benchmark for evaluating long-form causal reasoning in large video language models, highlighting their current limitations and proposing a modular recognition-reasoning approach to improve performance.

Contribution

The paper presents VCRBench, a novel benchmark for video causal reasoning, and proposes Recognition-Reasoning Decomposition (RRD), a modular method that significantly enhances LVLMs' reasoning capabilities.

Findings

01

LVLMs struggle with long-range causal dependencies in videos.

02

RRD improves accuracy on VCRBench by up to 25.2%.

03

LVLMs mainly rely on language knowledge rather than visual reasoning.

Abstract

Despite recent advances in video understanding, the capabilities of Large Video Language Models (LVLMs) to perform video-based causal reasoning remains underexplored, largely due to the absence of relevant and dedicated benchmarks for evaluating causal reasoning in visually grounded and goal-driven settings. To fill this gap, we introduce a novel benchmark named Video-based long-form Causal Reasoning (VCRBench). We create VCRBench using procedural videos of simple everyday activities, where the steps are deliberately shuffled with each clip capturing a key causal event, to test whether LVLMs can identify, reason about, and correctly sequence the events needed to accomplish a specific goal. Moreover, the benchmark is carefully designed to prevent LVLMs from exploiting linguistic shortcuts, as seen in multiple-choice or binary QA formats, while also avoiding the challenges associated with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pritamqu/vcrbench
pytorchOfficial

Datasets

pritamqu/VCRBench
dataset· 45 dl
45 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training