On the Planning Abilities of Large Language Models (A Critical   Investigation with a Proposed Benchmark)

Karthik Valmeekam; Sarath Sreedharan; Matthew Marquez; Alberto Olmo,; Subbarao Kambhampati

arXiv:2302.06706·cs.AI·February 15, 2023·31 cites

On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark)

Karthik Valmeekam, Sarath Sreedharan, Matthew Marquez, Alberto Olmo,, Subbarao Kambhampati

PDF

Open Access

TL;DR

This paper critically examines the planning abilities of large language models by developing a benchmark and evaluating their performance in autonomous, heuristic, and human-in-the-loop modes, revealing limited autonomous planning success.

Contribution

The paper introduces a new benchmark suite for evaluating LLMs in planning tasks and systematically assesses their planning capabilities across different modes.

Findings

01

LLMs have about 3% success in autonomous planning.

02

Heuristic and human-in-the-loop modes perform slightly better.

03

Benchmark and evaluation tools are publicly available.

Abstract

Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) how good LLMs are by themselves in generating and validating simple plans in commonsense planning tasks (of the type that humans are generally quite good at) and (2) how good LLMs are in being a source of heuristic guidance for other agents--either AI planners or human planners--in their planning tasks. To investigate these questions in a systematic rather than anecdotal manner, we start by developing a benchmark suite based on the kinds of domains employed in the International Planning Competition. On this benchmark, we evaluate LLMs in three modes: autonomous, heuristic and human-in-the-loop. Our results show that LLM's ability to autonomously generate executable plans is quite meager, averaging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems