Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana, Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy, Liang, Li Fei-Fei, Jiayuan Mao, Jiajun Wu

TL;DR
This paper introduces a comprehensive benchmark and interface for evaluating Large Language Models in embodied decision-making tasks, addressing previous evaluation limitations and providing detailed insights into LLM capabilities and weaknesses.
Contribution
It proposes a unified interface and a set of fine-grained metrics to systematically assess LLMs across various embodied decision-making tasks and modules.
Findings
Identifies specific error types like hallucinations and planning errors.
Provides detailed performance breakdowns of LLMs in embodied tasks.
Highlights strengths and weaknesses of LLMs in decision-making contexts.
Abstract
We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performance because they are usually applied in different domains, for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn blocks embodied agents from leveraging LLMs effectively and selectively. To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies
MethodsSparse Evolutionary Training
