How Far Are Video Models from True Multimodal Reasoning?

Xiaotian Zhang; Jianhui Wei; Yuan Wang; Jie Tan; Yichen Li; Yan Zhang; Ziyi Chen; Daoan Zhang; Dezhi YU; Wei Xu; Songtao Jiang; Zuozhu Liu

arXiv:2604.19193·cs.CV·April 22, 2026

How Far Are Video Models from True Multimodal Reasoning?

Xiaotian Zhang, Jianhui Wei, Yuan Wang, Jie Tan, Yichen Li, Yan Zhang, Ziyi Chen, Daoan Zhang, Dezhi YU, Wei Xu, Songtao Jiang, Zuozhu Liu

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces CLVG-Bench, a comprehensive evaluation framework for assessing the true multimodal reasoning capabilities of video models, revealing significant gaps in current state-of-the-art models.

Contribution

The paper presents CLVG-Bench and AVE, novel tools for rigorous, scalable, and human-aligned assessment of video models' reasoning abilities across complex scenarios.

Findings

01

State-of-the-art models perform poorly on logical and interactive tasks (<25% success)

02

Current models excel only in basic understanding and reasoning subtasks

03

The framework exposes critical bottlenecks in multimodal reasoning and physical grounding.

Abstract

Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models' zero-shot reasoning capabilities via Context Learning in Video Generation. CLVG-Bench comprises more than 1,000 high-quality, manually annotated metadata across 6 categories and 47 subcategories, covering complex scenarios including physical simulation, logical reasoning, and interactive contexts. To enable rigorous and scalable assessment, we further propose an Adaptive Video Evaluator (AVE) that aligns with human expert…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Datasets

Moncyan/CLVG-Bench
dataset· 52 dl
52 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.