EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?
Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, Yueting Zhuang

TL;DR
EOC-Bench is a new benchmark for evaluating multimodal large language models' ability to understand, recall, and predict objects in dynamic egocentric environments, addressing a gap in existing static scene-focused benchmarks.
Contribution
The paper introduces EOC-Bench, a comprehensive benchmark with annotated QA pairs and a novel temporal accuracy metric for assessing object-centric cognition in dynamic egocentric scenarios.
Findings
MLLMs show varied performance across different temporal categories.
EOC-Bench enables systematic evaluation of object understanding in dynamic contexts.
Benchmark facilitates development of more reliable embodied cognition models.
Abstract
The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object's appearance and spatial attributes while neglecting the assessment of dynamic changes arising from users' interactions. To address this gap, we introduce EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios. Specially, EOC-Bench features 3,277 meticulously annotated QA pairs categorized into three temporal categories: Past, Present, and Future, covering 11 fine-grained evaluation dimensions and 3 visual object referencing types. To ensure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Action Observation and Synchronization · Social Robot Interaction and HRI
MethodsFocus
