TL;DR
Minerva-Ego introduces a new benchmark for egocentric video reasoning that includes multi-step questions and reasoning traces, revealing significant gaps in current models and showing that targeted hints improve performance.
Contribution
The paper presents Minerva-Ego, a comprehensive benchmark with spatiotemporal reasoning traces and multimodal questions for egocentric video understanding, and demonstrates the effectiveness of prompting models with spatial-temporal hints.
Findings
State-of-the-art models lag behind human performance.
Spatiotemporal mask annotations help analyze reasoning gaps.
Hint-based prompting significantly improves model accuracy.
Abstract
Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatiotemporal mask annotations. Through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
