Chain of Thoughtlessness? An Analysis of CoT in Planning
Kaya Stechly, Karthik Valmeekam, Subbarao Kambhampati

TL;DR
This study critically examines the effectiveness of chain of thought prompting in large language models for classical planning problems, revealing that its benefits are limited to highly specific prompts and do not reflect learning general algorithms.
Contribution
The paper provides a detailed case study showing that chain of thought prompts require extensive problem-specific engineering and do not facilitate general algorithm learning in LLMs.
Findings
Performance gains depend on highly specific prompts
Improvements diminish as query complexity increases
Failure modes are consistent across domains
Abstract
Large language model (LLM) performance on reasoning problems typically does not generalize out of distribution. Previous work has claimed that this can be mitigated with chain of thought prompting-a method of demonstrating solution procedures-with the intuition that it is possible to in-context teach an LLM an algorithm for solving the problem. This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain, and examines the performance of two state-of-the-art LLMs across two axes: generality of examples given in prompt, and complexity of problems queried with each prompt. While our problems are very simple, we only find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class, and that those improvements quickly deteriorate as the size n of the query-specified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEducational Tools and Methods · Innovative Teaching and Learning Methods
MethodsHierarchical Information Threading
