Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

Arsha Nagrani; Jasper Uijilings; Shyamal Buch; Tobias Weyand; Sudheendra Vijayanarasimhan; Bo Hu; Ramin Mehran; David A Ross; Cordelia Schmid

arXiv:2605.15342·cs.CV·May 18, 2026

Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

Arsha Nagrani, Jasper Uijilings, Shyamal Buch, Tobias Weyand, Sudheendra Vijayanarasimhan, Bo Hu, Ramin Mehran, David A Ross, Cordelia Schmid

PDF

1 Repo

TL;DR

Minerva-Ego introduces a new benchmark for egocentric video reasoning that includes multi-step questions and reasoning traces, revealing significant gaps in current models and showing that targeted hints improve performance.

Contribution

The paper presents Minerva-Ego, a comprehensive benchmark with spatiotemporal reasoning traces and multimodal questions for egocentric video understanding, and demonstrates the effectiveness of prompting models with spatial-temporal hints.

Findings

01

State-of-the-art models lag behind human performance.

02

Spatiotemporal mask annotations help analyze reasoning gaps.

03

Hint-based prompting significantly improves model accuracy.

Abstract

Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatiotemporal mask annotations. Through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-deepmind/neptune
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.