CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models

Jingyao Li; Jingyun Wang; Molin Tan; Haochen Wang; Cilin Yan; Likun Shi; Jiayin Cai; Xiaolong Jiang; Yao Hu

arXiv:2511.12263·cs.CV·December 2, 2025

CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models

Jingyao Li, Jingyun Wang, Molin Tan, Haochen Wang, Cilin Yan, Likun Shi, Jiayin Cai, Xiaolong Jiang, Yao Hu

PDF

Open Access 1 Datasets 1 Video

TL;DR

CrossVid is a new comprehensive benchmark designed to evaluate multimodal large language models' ability to perform complex reasoning across multiple videos, addressing a gap in existing single-video focused assessments.

Contribution

We introduce CrossVid, the first benchmark with diverse hierarchical tasks and extensive video-question pairs to evaluate cross-video reasoning in multimodal models.

Findings

01

Gemini-2.5-Pro achieves 50.4% accuracy on CrossVid.

02

Most MLLMs struggle with evidence integration across videos.

03

CrossVid reveals limitations in current models' reasoning over multiple videos.

Abstract

Cross-Video Reasoning (CVR) presents a significant challenge in video understanding, which requires simultaneous understanding of multiple videos to aggregate and compare information across groups of videos. Most existing video understanding benchmarks focus on single-video analysis, failing to assess the ability of multimodal large language models (MLLMs) to simultaneously reason over various videos. Recent benchmarks evaluate MLLMs' capabilities on multi-view videos that capture different perspectives of the same scene. However, their limited tasks hinder a thorough assessment of MLLMs in diverse real-world CVR scenarios. To this end, we introduce CrossVid, the first benchmark designed to comprehensively evaluate MLLMs' spatial-temporal reasoning ability in cross-video contexts. Firstly, CrossVid encompasses a wide spectrum of hierarchical tasks, comprising four high-level dimensions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Chuntianli/CrossVid
dataset· 7.7k dl
7.7k dl

Videos

CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks