EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents
Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu,, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, Maosong Sun

TL;DR
EmbodiedEval is a comprehensive benchmark with 328 diverse embodied tasks in 3D environments designed to evaluate multimodal large language models' embodied capabilities, revealing significant gaps compared to human performance.
Contribution
We introduce EmbodiedEval, a new interactive benchmark with diverse tasks and scenes to evaluate MLLMs' embodied skills, addressing limitations of previous static and task-specific benchmarks.
Findings
State-of-the-art MLLMs perform significantly below humans on embodied tasks.
EmbodiedEval covers five categories: navigation, object interaction, social interaction, attribute and spatial question answering.
The benchmark reveals current MLLMs' limitations in embodied AI capabilities.
Abstract
Multimodal Large Language Models (MLLMs) have shown significant advancements, providing a promising future for embodied agents. Existing benchmarks for evaluating MLLMs primarily utilize static images or videos, limiting assessments to non-interactive scenarios. Meanwhile, existing embodied AI benchmarks are task-specific and not diverse enough, which do not adequately evaluate the embodied capabilities of MLLMs. To address this, we propose EmbodiedEval, a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks. EmbodiedEval features 328 distinct tasks within 125 varied 3D scenes, each of which is rigorously selected and annotated. It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity, all within a unified simulation and evaluation framework tailored for MLLMs. The tasks are organized into five categories: navigation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
