ECBench: Can Multi-modal Foundation Models Understand the Egocentric   World? A Holistic Embodied Cognition Benchmark

Ronghao Dang; Yuqian Yuan; Wenqi Zhang; Yifei Xin; Boqiang Zhang; Long; Li; Liuyi Wang; Qinyang Zeng; Xin Li; Lidong Bing

arXiv:2501.05031·cs.CV·March 14, 2025

ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark

Ronghao Dang, Yuqian Yuan, Wenqi Zhang, Yifei Xin, Boqiang Zhang, Long, Li, Liuyi Wang, Qinyang Zeng, Xin Li, Lidong Bing

PDF

Open Access 1 Repo

TL;DR

This paper introduces ECBench, a comprehensive benchmark for evaluating the embodied cognitive abilities of large vision-language models in egocentric video understanding, addressing current evaluation gaps.

Contribution

It presents ECBench, a systematic and high-quality benchmark with diverse data and evaluation metrics, to assess embodied cognition in LVLMs.

Findings

01

ECBbench enables detailed evaluation of LVLMs' embodied cognition.

02

Proprietary and open-source LVLMs show varied performance on ECBench.

03

The benchmark highlights key challenges like scene perception and hallucination in embodied models.

Abstract

The enhancement of generalization in robots by large vision-language models (LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of LVLMs based on egocentric videos are of great interest. However, current datasets for embodied video question answering lack comprehensive and systematic evaluation frameworks. Critical embodied cognitive issues, such as robotic self-cognition, dynamic scene perception, and hallucination, are rarely addressed. To tackle these challenges, we propose ECBench, a high-quality benchmark designed to systematically evaluate the embodied cognitive abilities of LVLMs. ECBench features a diverse range of scene video sources, open and varied question formats, and 30 dimensions of embodied cognition. To ensure quality, balance, and high visual dependence, ECBench uses class-independent meticulous human annotation and multi-round question…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rh-dang/ecbench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation