EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

Chengjun Yu; Xuhan Zhu; Chaoqun Du; Pengfei Yu; Wei Zhai; Yang Cao; Zheng-Jun Zha

arXiv:2603.09731·cs.CV·March 13, 2026

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

Chengjun Yu, Xuhan Zhu, Chaoqun Du, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha

PDF

Open Access

TL;DR

This paper introduces EXPLORE-Bench, a new benchmark for evaluating multimodal large language models on their ability to perform long-horizon egocentric scene prediction, revealing significant gaps compared to human reasoning.

Contribution

The paper presents a novel benchmark, EXPLORE-Bench, for systematic evaluation of long-term egocentric scene prediction by multimodal models, with detailed annotations and diverse real-world scenarios.

Findings

01

Models lag behind humans in long-horizon egocentric reasoning.

02

Decomposing action sequences can improve model performance.

03

Stepwise reasoning incurs computational overhead.

Abstract

Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition