CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

Yuefei Chen; Jiang Liu; Xiaodong Lin; Ruixiang Tang

arXiv:2511.19923·cs.CV·November 26, 2025

CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

Yuefei Chen, Jiang Liu, Xiaodong Lin, Ruixiang Tang

PDF

Open Access

TL;DR

This paper introduces CounterVQA, a benchmark for evaluating counterfactual reasoning in video-language models, revealing current limitations and proposing CFGPT to improve reasoning capabilities through language-based distillation.

Contribution

The paper presents CounterVQA, a new benchmark for counterfactual reasoning in videos, and proposes CFGPT, a method to enhance models' reasoning by leveraging language modality.

Findings

01

Models perform poorly on complex counterfactual questions.

02

CounterVQA reveals significant gaps in current models' reasoning abilities.

03

CFGPT improves reasoning performance across all difficulty levels.

Abstract

Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)