EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

Yuqian Yuan; Ronghao Dang; Long Li; Wentong Li; Dian Jiao; Xin Li; Deli Zhao; Fan Wang; Wenqiao Zhang; Jun Xiao; Yueting Zhuang

arXiv:2506.05287·cs.CV·June 6, 2025

EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, Yueting Zhuang

PDF

Open Access 1 Datasets

TL;DR

EOC-Bench is a new benchmark for evaluating multimodal large language models' ability to understand, recall, and predict objects in dynamic egocentric environments, addressing a gap in existing static scene-focused benchmarks.

Contribution

The paper introduces EOC-Bench, a comprehensive benchmark with annotated QA pairs and a novel temporal accuracy metric for assessing object-centric cognition in dynamic egocentric scenarios.

Findings

01

MLLMs show varied performance across different temporal categories.

02

EOC-Bench enables systematic evaluation of object understanding in dynamic contexts.

03

Benchmark facilitates development of more reliable embodied cognition models.

Abstract

The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object's appearance and spatial attributes while neglecting the assessment of dynamic changes arising from users' interactions. To address this gap, we introduce EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios. Specially, EOC-Bench features 3,277 meticulously annotated QA pairs categorized into three temporal categories: Past, Present, and Future, covering 11 fine-grained evaluation dimensions and 3 visual object referencing types. To ensure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

CircleRadon/EOC-Bench
dataset· 42 dl
42 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Action Observation and Synchronization · Social Robot Interaction and HRI

MethodsFocus