LLMs Can Plan Only If We Tell Them

Bilgehan Sel; Ruoxi Jia; Ming Jin

arXiv:2501.13545·cs.CL·January 24, 2025

LLMs Can Plan Only If We Tell Them

Bilgehan Sel, Ruoxi Jia, Ming Jin

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that large language models can generate effective long-horizon plans autonomously when enhanced with specific algorithmic improvements, surpassing previous methods and human performance on planning benchmarks.

Contribution

The authors introduce AoT+ enhancements to Algorithm-of-Thoughts, enabling LLMs to independently produce competitive long-term plans without external feedback.

Findings

01

Achieved state-of-the-art results in planning benchmarks

02

Outperformed prior methods and human baselines

03

Enabled autonomous long-horizon planning with LLMs

Abstract

Large language models (LLMs) have demonstrated significant capabilities in natural language processing and reasoning, yet their effectiveness in autonomous planning has been under debate. While existing studies have utilized LLMs with external feedback mechanisms or in controlled environments for planning, these approaches often involve substantial computational and development resources due to the requirement for careful design and iterative backprompting. Moreover, even the most advanced LLMs like GPT-4 struggle to match human performance on standard planning benchmarks, such as the Blocksworld, without additional support. This paper investigates whether LLMs can independently generate long-horizon plans that rival human baselines. Our novel enhancements to Algorithm-of-Thoughts (AoT), which we dub AoT+, help achieve state-of-the-art results in planning benchmarks out-competing prior…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The paper presents a novel perspective challenging both overly pessimistic and optimistic views of LLMs' planning capabilities. The AoT+ innovations are creative combinations of existing ideas since it uses periodic state regeneration to manage attention/cognitive load. There is comprehensive empirical evaluation across multiple challenging benchmarks, such as clear ablation of components through comparison of AoT vs AoT+ The paper has well-structured progression of ideas from problem motivation

Weaknesses

The paper focuses heavily on successful cases but lacks systematic analysis of where AoT+ fails. While the paper compares AoT vs AoT+, it doesn't fully isolate the impact of each innovation. The AoT+ assumes we have a pddl instance of the problem, so I'm not sure if this method is scalable to general domain.

Reviewer 02Rating 8Confidence 4

Strengths

Although the work has been built on existing AoT work, it still shows several strengths including: - Achieving state-of-the-art performance across complex planning benchmarks without the need for external verification tools, - Unlike approaches like Tree-of-Thought (ToT) that require extensive API requests and computational resources, AoT+ operates efficiently within a single-prompt framework, cutting down on token usage and latency. This improvement is also observed in AoT but token counts in

Weaknesses

Authors have done great work, and can potentially improve the paper more by addressing the following: - In terms of presentation I expect a more clear diagram explaining different stages of the proposed method. It took me some time to get a better sense of the proposed method by going through the details in methodology section. - While AoT+ performs well on the benchmarks reported in the paper, evaluation on real-world planning tasks like pathfinding for robotics would strengthen the work. - Wh

Reviewer 03Rating 6Confidence 4

Strengths

- It presents a significant advancement in the autonomous planning capabilities of LLMs, demonstrating their potential to match or exceed human performance. - This paper proposes a prompting strategy to generate long-horizon plans.

Weaknesses

- First, the identification of your performance gap has already been established[1]. - However, several key baselines are missing. Although significant research addresses planning optimization strategies, much of it does not conduct experiments in the blocksworld domain [2-4]. Furthermore, baseline [5], which even operates in blocksworld, has not been directly compared. - Given that your method relies on search-based techniques, it would be beneficial to include comparisons with MCTS-Decoding or

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCancer Genomics and Diagnostics

MethodsAttention Is All You Need · Adam · Softmax · Absolute Position Encodings · Residual Connection · Dropout · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer