On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark)
Karthik Valmeekam, Sarath Sreedharan, Matthew Marquez, Alberto Olmo,, Subbarao Kambhampati

TL;DR
This paper critically examines the planning abilities of large language models by developing a benchmark and evaluating their performance in autonomous, heuristic, and human-in-the-loop modes, revealing limited autonomous planning success.
Contribution
The paper introduces a new benchmark suite for evaluating LLMs in planning tasks and systematically assesses their planning capabilities across different modes.
Findings
LLMs have about 3% success in autonomous planning.
Heuristic and human-in-the-loop modes perform slightly better.
Benchmark and evaluation tools are publicly available.
Abstract
Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) how good LLMs are by themselves in generating and validating simple plans in commonsense planning tasks (of the type that humans are generally quite good at) and (2) how good LLMs are in being a source of heuristic guidance for other agents--either AI planners or human planners--in their planning tasks. To investigate these questions in a systematic rather than anecdotal manner, we start by developing a benchmark suite based on the kinds of domains employed in the International Planning Competition. On this benchmark, we evaluate LLMs in three modes: autonomous, heuristic and human-in-the-loop. Our results show that LLM's ability to autonomously generate executable plans is quite meager, averaging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
