Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

Manling Li; Shiyu Zhao; Qineng Wang; Kangrui Wang; Yu Zhou; Sanjana; Srivastava; Cem Gokmen; Tony Lee; Li Erran Li; Ruohan Zhang; Weiyu Liu; Percy; Liang; Li Fei-Fei; Jiayuan Mao; Jiajun Wu

arXiv:2410.07166·cs.CL·January 22, 2025·3 cites

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana, Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy, Liang, Li Fei-Fei, Jiayuan Mao, Jiajun Wu

PDF

Open Access 3 Repos 2 Datasets 1 Video

TL;DR

This paper introduces a comprehensive benchmark and interface for evaluating Large Language Models in embodied decision-making tasks, addressing previous evaluation limitations and providing detailed insights into LLM capabilities and weaknesses.

Contribution

It proposes a unified interface and a set of fine-grained metrics to systematically assess LLMs across various embodied decision-making tasks and modules.

Findings

01

Identifies specific error types like hallucinations and planning errors.

02

Provides detailed performance breakdowns of LLMs in embodied tasks.

03

Highlights strengths and weaknesses of LLMs in decision-making contexts.

Abstract

We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performance because they are usually applied in different domains, for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn blocks embodied agents from leveraging LLMs effectively and selectively. To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making· slideslive

Taxonomy

TopicsSemantic Web and Ontologies

MethodsSparse Evolutionary Training