EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

Zhili Cheng; Yuge Tu; Ran Li; Shiqi Dai; Jinyi Hu; Shengding Hu,; Jiahao Li; Yang Shi; Tianyu Yu; Weize Chen; Lei Shi; Maosong Sun

arXiv:2501.11858·cs.CV·April 14, 2025

EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu,, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, Maosong Sun

PDF

Open Access 1 Repo 1 Datasets

TL;DR

EmbodiedEval is a comprehensive benchmark with 328 diverse embodied tasks in 3D environments designed to evaluate multimodal large language models' embodied capabilities, revealing significant gaps compared to human performance.

Contribution

We introduce EmbodiedEval, a new interactive benchmark with diverse tasks and scenes to evaluate MLLMs' embodied skills, addressing limitations of previous static and task-specific benchmarks.

Findings

01

State-of-the-art MLLMs perform significantly below humans on embodied tasks.

02

EmbodiedEval covers five categories: navigation, object interaction, social interaction, attribute and spatial question answering.

03

The benchmark reveals current MLLMs' limitations in embodied AI capabilities.

Abstract

Multimodal Large Language Models (MLLMs) have shown significant advancements, providing a promising future for embodied agents. Existing benchmarks for evaluating MLLMs primarily utilize static images or videos, limiting assessments to non-interactive scenarios. Meanwhile, existing embodied AI benchmarks are task-specific and not diverse enough, which do not adequately evaluate the embodied capabilities of MLLMs. To address this, we propose EmbodiedEval, a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks. EmbodiedEval features 328 distinct tasks within 125 varied 3D scenes, each of which is rigorously selected and annotated. It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity, all within a unified simulation and evaluation framework tailored for MLLMs. The tasks are organized into five categories: navigation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp/embodiedeval
noneOfficial

Datasets

EmbodiedEval/EmbodiedEval
dataset· 65 dl
65 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems