How Far Are Video Models from True Multimodal Reasoning?
Xiaotian Zhang, Jianhui Wei, Yuan Wang, Jie Tan, Yichen Li, Yan Zhang, Ziyi Chen, Daoan Zhang, Dezhi YU, Wei Xu, Songtao Jiang, Zuozhu Liu

TL;DR
This paper introduces CLVG-Bench, a comprehensive evaluation framework for assessing the true multimodal reasoning capabilities of video models, revealing significant gaps in current state-of-the-art models.
Contribution
The paper presents CLVG-Bench and AVE, novel tools for rigorous, scalable, and human-aligned assessment of video models' reasoning abilities across complex scenarios.
Findings
State-of-the-art models perform poorly on logical and interactive tasks (<25% success)
Current models excel only in basic understanding and reasoning subtasks
The framework exposes critical bottlenecks in multimodal reasoning and physical grounding.
Abstract
Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models' zero-shot reasoning capabilities via Context Learning in Video Generation. CLVG-Bench comprises more than 1,000 high-quality, manually annotated metadata across 6 categories and 47 subcategories, covering complex scenarios including physical simulation, logical reasoning, and interactive contexts. To enable rigorous and scalable assessment, we further propose an Adaptive Video Evaluator (AVE) that aligns with human expert…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
