VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models

Chenglin Li; Qianglong Chen; Zhi Li; Feng Tao; Yin Zhang

arXiv:2411.09105·cs.CV·July 2, 2025

VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models

Chenglin Li, Qianglong Chen, Zhi Li, Feng Tao, Yin Zhang

PDF

Open Access 3 Reviews

TL;DR

VideoCogQA introduces a controllable synthetic video benchmark to evaluate the cognitive abilities of video-language models, revealing current models' limitations in handling abstract and symbolic tasks.

Contribution

The paper presents VideoCogQA, a novel synthetic video benchmark that enables fine-grained control over content and difficulty to assess cognitive skills in LVLMs.

Findings

01

State-of-the-art models score below 50% on abstract tasks.

02

Performance decreases by 15% with increased task complexity.

03

Current models struggle with symbolic and abstract reasoning in videos.

Abstract

Recent advancements in Large Video-Language Models (LVLMs) have led to promising results in multimodal video understanding. However, it remains unclear whether these models possess the cognitive capabilities required for high-level tasks, particularly those involving symbolic and abstract perception. Existing benchmarks typically rely on real-world, annotated videos, which lack control over video content and inherent difficulty, limiting their diagnostic power. To bridge this gap, we propose VideoCogQA, a scalable and fully controllable benchmark inspired by game-world environments, designed to evaluate the cognitive abilities of LVLMs. By generating synthetic videos via a programmatic engine, VideoCogQA allows fine-grained control over visual elements, temporal dynamics, and task difficulty. This approach enables a focused evaluation of video cognitive abilities, independent of prior…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The benchmark adds Game-environment and Full-modal, explicitly targeting symbolic/abstract attributes (size, color, shape) and temporal/spatial relations. 2. The authors programmatically synthesize videos with parameterized difficulty and log code-level events, then generate QA templates with GPT-4 and human filtering.

Weaknesses

1. Question templates originate from GPT-4 and are then filtered; more auditing of prompt templates and filtering criteria may strengthen validity claims and reproducibility. 2. The “~90% human” number isn’t well documented. We don’t know how many people were tested, how much time they had, whether they could replay the video, or how consistent the labels were. That makes the human ceiling hard to trust and compare against models.

Reviewer 02Rating 4Confidence 3

Strengths

- The paper demonstrates that synthetic videos can be automatically generated from a game simulation engine, and that LLM-based instruction templates are created for each game according to predefined question categories. This approach enables dataset generation at scale, without being constrained by data size. To support this, the authors propose a Python-based video synthesis pipeline. - The authors introduce VideoCogQA, a scalable and fully controllable benchmark. This benchmark is well-or

Weaknesses

- Lack of Details on Dataset Distribution - The paper does not provide sufficient details or analysis regarding the dataset distribution. It would be beneficial to include a detailed breakdown of the number of samples per category, organized by game and by difficulty level. The current explanation in Section 3.2 is largely textual and difficult to fully understand. - It would also be helpful to report how each game covers the different question categories, and how the VLM (Vision-Langu

Reviewer 03Rating 6Confidence 3

Strengths

1. Controllability & difficulty. Clear, code-level knobs (e.g., grid size) allow precise difficulty control, improving diagnostic value. 2. Breadth of skills. Ten diverse scenes spanning object/action perception, spatial/temporal reasoning, game environment understanding, and audio-visual mapping. 3. Well-Documented Human–Model Gap. The paper clearly reports a substantial gap between human and model performance across all tasks and scenarios.

Weaknesses

1. Lack of random baseline. With 3–5 options, the performance of random choice can be 20–33%. This paper does not foreground a random baseline. 2. Lack of connection to Real-World Tasks. The paper does not extensively discuss the connection between VideoCogQA and real-world tasks. The current justification, based primarily on frame sampling, is insufficient. It remains unclear whether performance on specific VideoCogQA tasks correlates with performance on real-world tasks. Clarifying whether su

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPsychiatry, Mental Health, Neuroscience