LLMs Can Plan Only If We Tell Them
Bilgehan Sel, Ruoxi Jia, Ming Jin

TL;DR
This paper demonstrates that large language models can generate effective long-horizon plans autonomously when enhanced with specific algorithmic improvements, surpassing previous methods and human performance on planning benchmarks.
Contribution
The authors introduce AoT+ enhancements to Algorithm-of-Thoughts, enabling LLMs to independently produce competitive long-term plans without external feedback.
Findings
Achieved state-of-the-art results in planning benchmarks
Outperformed prior methods and human baselines
Enabled autonomous long-horizon planning with LLMs
Abstract
Large language models (LLMs) have demonstrated significant capabilities in natural language processing and reasoning, yet their effectiveness in autonomous planning has been under debate. While existing studies have utilized LLMs with external feedback mechanisms or in controlled environments for planning, these approaches often involve substantial computational and development resources due to the requirement for careful design and iterative backprompting. Moreover, even the most advanced LLMs like GPT-4 struggle to match human performance on standard planning benchmarks, such as the Blocksworld, without additional support. This paper investigates whether LLMs can independently generate long-horizon plans that rival human baselines. Our novel enhancements to Algorithm-of-Thoughts (AoT), which we dub AoT+, help achieve state-of-the-art results in planning benchmarks out-competing prior…
Peer Reviews
Decision·ICLR 2025 Poster
The paper presents a novel perspective challenging both overly pessimistic and optimistic views of LLMs' planning capabilities. The AoT+ innovations are creative combinations of existing ideas since it uses periodic state regeneration to manage attention/cognitive load. There is comprehensive empirical evaluation across multiple challenging benchmarks, such as clear ablation of components through comparison of AoT vs AoT+ The paper has well-structured progression of ideas from problem motivation
The paper focuses heavily on successful cases but lacks systematic analysis of where AoT+ fails. While the paper compares AoT vs AoT+, it doesn't fully isolate the impact of each innovation. The AoT+ assumes we have a pddl instance of the problem, so I'm not sure if this method is scalable to general domain.
Although the work has been built on existing AoT work, it still shows several strengths including: - Achieving state-of-the-art performance across complex planning benchmarks without the need for external verification tools, - Unlike approaches like Tree-of-Thought (ToT) that require extensive API requests and computational resources, AoT+ operates efficiently within a single-prompt framework, cutting down on token usage and latency. This improvement is also observed in AoT but token counts in
Authors have done great work, and can potentially improve the paper more by addressing the following: - In terms of presentation I expect a more clear diagram explaining different stages of the proposed method. It took me some time to get a better sense of the proposed method by going through the details in methodology section. - While AoT+ performs well on the benchmarks reported in the paper, evaluation on real-world planning tasks like pathfinding for robotics would strengthen the work. - Wh
- It presents a significant advancement in the autonomous planning capabilities of LLMs, demonstrating their potential to match or exceed human performance. - This paper proposes a prompting strategy to generate long-horizon plans.
- First, the identification of your performance gap has already been established[1]. - However, several key baselines are missing. Although significant research addresses planning optimization strategies, much of it does not conduct experiments in the blocksworld domain [2-4]. Furthermore, baseline [5], which even operates in blocksworld, has not been directly compared. - Given that your method relies on search-based techniques, it would be beneficial to include comparisons with MCTS-Decoding or
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCancer Genomics and Diagnostics
MethodsAttention Is All You Need · Adam · Softmax · Absolute Position Encodings · Residual Connection · Dropout · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
