CVBench: Benchmarking Cross-Video Synergies for Complex Multimodal Reasoning

Nannan Zhu; Yonghao Dong; Teng Wang; Xueqian Li; Shengjun Deng; Yijia Wang; Zheng Hong; Tiantian Geng; Guo Niu; Hanyan Huang; Xiongfei Yao; Shuaiwei Jiao

arXiv:2508.19542·cs.CV·January 7, 2026

CVBench: Benchmarking Cross-Video Synergies for Complex Multimodal Reasoning

Nannan Zhu, Yonghao Dong, Teng Wang, Xueqian Li, Shengjun Deng, Yijia Wang, Zheng Hong, Tiantian Geng, Guo Niu, Hanyan Huang, Xiongfei Yao, Shuaiwei Jiao

PDF

TL;DR

CVBench is a new benchmark designed to evaluate the ability of multimodal large language models to perform complex reasoning across multiple videos, highlighting current limitations and guiding future improvements.

Contribution

Introduces CVBench, the first diagnostic benchmark for cross-video relational reasoning, and provides comprehensive evaluation of leading models revealing significant performance gaps.

Findings

01

Top models achieve only 63.5% accuracy on causal reasoning tasks.

02

Current models have fundamental bottlenecks like poor inter-video context retention.

03

CVBench offers insights for developing next-generation multi-video reasoning models.

Abstract

While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their capability for spatiotemporal pattern reasoning across multiple videos remains a critical gap in pattern recognition research. However, this capability is essential for real-world applications, including multi-camera surveillance and cross-video procedural learning. To bridge this gap, we present CVBench, the first diagnostic benchmark designed to assess cross-video relational reasoning rigorously. CVBench comprises 1,000 question-answer pairs spanning three hierarchical tiers: cross-video object association (identifying shared entities), cross-video event association (linking temporal or causal event chains), and cross-video complex reasoning (integrating commonsense and domain knowledge). Built from five domain-diverse video clusters (e.g., sports,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.