CVBench: Benchmarking Cross-Video Synergies for Complex Multimodal Reasoning
Nannan Zhu, Yonghao Dong, Teng Wang, Xueqian Li, Shengjun Deng, Yijia Wang, Zheng Hong, Tiantian Geng, Guo Niu, Hanyan Huang, Xiongfei Yao, Shuaiwei Jiao

TL;DR
CVBench is a new benchmark designed to evaluate the ability of multimodal large language models to perform complex reasoning across multiple videos, highlighting current limitations and guiding future improvements.
Contribution
Introduces CVBench, the first diagnostic benchmark for cross-video relational reasoning, and provides comprehensive evaluation of leading models revealing significant performance gaps.
Findings
Top models achieve only 63.5% accuracy on causal reasoning tasks.
Current models have fundamental bottlenecks like poor inter-video context retention.
CVBench offers insights for developing next-generation multi-video reasoning models.
Abstract
While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their capability for spatiotemporal pattern reasoning across multiple videos remains a critical gap in pattern recognition research. However, this capability is essential for real-world applications, including multi-camera surveillance and cross-video procedural learning. To bridge this gap, we present CVBench, the first diagnostic benchmark designed to assess cross-video relational reasoning rigorously. CVBench comprises 1,000 question-answer pairs spanning three hierarchical tiers: cross-video object association (identifying shared entities), cross-video event association (linking temporal or causal event chains), and cross-video complex reasoning (integrating commonsense and domain knowledge). Built from five domain-diverse video clusters (e.g., sports,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
