MINERVA: Evaluating Complex Video Reasoning

Arsha Nagrani; Sachit Menon; Ahmet Iscen; Shyamal Buch; Ramin Mehran,; Nilpa Jha; Anja Hauth; Yukun Zhu; Carl Vondrick; Mikhail Sirotenko; Cordelia; Schmid; Tobias Weyand

arXiv:2505.00681·cs.LG·May 2, 2025

MINERVA: Evaluating Complex Video Reasoning

Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran,, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, Cordelia, Schmid, Tobias Weyand

PDF

1 Repo

TL;DR

MINERVA introduces a new video reasoning dataset with detailed reasoning traces to evaluate and analyze the reasoning capabilities of multimodal models, revealing common failure modes and advancing understanding of video comprehension.

Contribution

The paper presents MINERVA, a comprehensive video reasoning dataset with annotated reasoning traces, enabling detailed evaluation and analysis of multimodal models' reasoning abilities.

Findings

01

Models struggle with temporal localization.

02

Visual perception errors are common.

03

Logical errors are less frequent.

Abstract

Multimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challenging to assess if models are truly able to combine perceptual and temporal information to reason about videos, or simply get the correct answer by chance or by exploiting linguistic biases. To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models. Each question in the dataset comes with 5 answer choices, as well as detailed, hand-crafted reasoning traces. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. Extensive benchmarking shows that our dataset provides a challenge for frontier open-source and proprietary models. We perform fine-grained error analysis to identify common…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-deepmind/neptune
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus