Robotouille: An Asynchronous Planning Benchmark for LLM Agents

Gonzalo Gonzalez-Pumariega; Leong Su Yean; Neha Sunkara; Sanjiban; Choudhury

arXiv:2502.05227·cs.RO·February 11, 2025

Robotouille: An Asynchronous Planning Benchmark for LLM Agents

Gonzalo Gonzalez-Pumariega, Leong Su Yean, Neha Sunkara, Sanjiban, Choudhury

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

Robotouille introduces a new benchmark environment to evaluate large language model agents' ability to perform complex, long-horizon asynchronous planning tasks, revealing current limitations and areas for improvement.

Contribution

The paper presents Robotouille, a novel asynchronous planning benchmark for LLM agents, with datasets capturing complex scenarios beyond existing short-horizon benchmarks.

Findings

01

ReAct (gpt4-o) achieves 47% on synchronous tasks.

02

ReAct (gpt4-o) achieves 11% on asynchronous tasks.

03

Analysis highlights the need for better long-horizon feedback incorporation.

Abstract

Effective asynchronous planning, or the ability to efficiently reason and plan over states and actions that must happen in parallel or sequentially, is essential for agents that must account for time delays, reason over diverse long-horizon tasks, and collaborate with other agents. While large language model (LLM) agents show promise in high-level task planning, current benchmarks focus primarily on short-horizon tasks and do not evaluate such asynchronous planning capabilities. We introduce Robotouille, a challenging benchmark environment designed to test LLM agents' ability to handle long-horizon asynchronous scenarios. Our synchronous and asynchronous datasets capture increasingly complex planning challenges that go beyond existing benchmarks, requiring agents to manage overlapping tasks and interruptions. Our results show that ReAct (gpt4-o) achieves 47% on synchronous tasks but…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 3Confidence 4

Strengths

- The paper proposes a benchmark for asynchronous planning based on an environment similar to overcooked. - The paper demonstrates that many existing LLMs fail short at planning problems that requires to consider time delays of sub-tasks.

Weaknesses

- While this paper proposes a new benchmark for certain types of LLM reasoning tasks, it does not include sufficient experimental evaluation and analysis to highlight its main challenges. Specifically, while there are many LLMs as well as LLM-based planning algorithms to tackle planning problems, the paper only experimented with one LLM and two planning algorithms (CoT and ReAct). More baselines are needed to better illustrate the challenges posed by the proposed benchmark. - It is unclear from

Reviewer 02Rating 8Confidence 4

Strengths

1. There is a notable gap in benchmarks for planning and decision-making tasks that involve asynchronous, long-horizon scenarios. The authors effectively compare their work to relevant literature, underscoring the significance of ROBOTOUILLE. Moreover, the cooking aspect of the tasks makes them relatable and easy to understand. 2. By conducting experiments across two different settings, the authors identify key reasons for failure modes in the asynchronous decision-making context, such as the i

Weaknesses

1. The paper lacks a discussion on prompt design. The performance differences between the synchronous and asynchronous datasets may stem from variations in prompts, which often determine the practical limits of an agent's capabilities. 2. Additionally, the tasks in the synchronous and asynchronous datasets differ, making it inappropriate to directly compare results from both settings during the analysis. 3. If possible, I suggest including experimental results from some open-source models in t

Reviewer 03Rating 6Confidence 4

Strengths

+ Well-written with clear illustrations of domain complexities. + Rigorous experimental design, featuring carefully curated datasets across multiple complexity levels and a detailed failure analysis.

Weaknesses

* The paper can be split into two main sections: 1) introducing and baselining the benchmark and 2) analyzing results. While the first part is strong, insights in the second part are somewhat obscured. For instance, in Q6, the majority of both successful (72.7%) and failed (52.8%) trajectories prioritized subtask completion. Does this indicate that subtask prioritization may not be a critical area for improvement in planning? I encourage authors to capitalize more on their analysis. * The absen

Code & Models

Repositories

portal-cornell/robotouille
noneOfficial

Videos

Robotouille: An Asynchronous Planning Benchmark for LLM Agents· slideslive

Taxonomy

TopicsModular Robots and Swarm Intelligence · Optimization and Search Problems · Robotic Path Planning Algorithms

MethodsFocus