Automating Thought of Search: A Journey Towards Soundness and Completeness

Daniel Cao; Michael Katz; Harsha Kokel; Kavitha Srinivas; Shirin Sohrabi

arXiv:2408.11326·cs.AI·May 29, 2025

Automating Thought of Search: A Journey Towards Soundness and Completeness

Daniel Cao, Michael Katz, Harsha Kokel, Kavitha Srinivas, Shirin Sohrabi

PDF

Open Access 5 Reviews

TL;DR

This paper introduces AutoToS, an automated approach that guides large language models to generate sound and complete search components for planning problems, eliminating the need for human involvement and achieving perfect accuracy.

Contribution

It presents AutoToS, a novel method that automates the Thought of Search process, enabling LLMs to produce sound and complete search components with minimal human input.

Findings

01

AutoToS achieves 100% accuracy across tested domains.

02

AutoToS requires only a small number of LLM calls.

03

The method effectively guides LLMs using feedback from unit tests.

Abstract

Large language models (LLMs) are being used to solve planning problems that require search. Most of the literature uses LLMs as world models to define the search space, forgoing soundness for the sake of flexibility. A recent work, Thought of Search (ToS), proposed defining the search space with code, having LLMs produce that code. ToS requires a human in the loop, collaboratively producing a sound successor function and goal test. The result, however, is worth the effort: all the tested datasets were solved with 100% accuracy. Consequently, there is great potential to automate the ToS process. We take a first major step towards automating ToS (AutoToS), taking the human out of the loop of interactions with the language model. AutoToS guides the language model step by step towards the generation of sound and complete search components, through feedback from both generic and domain…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

- The paper clearly motivates the need to automate the iterative feedback and exception handling process within the LLM-driven ToS paradigm, effectively shifting the burden from continuous human involvement to the specification of unit tests by humans. - Evaluation is conducted across multiple domains of varying complexity, including classic (i.e. with BlocksWorld and Sokoban), logic (i.e. with PrOntoQA), and mini-crosswords, as well as with various LLMs of different sizes. - AutoToS often achie

Weaknesses

- Even though much of the process is automated, the framework assumes either existing or easily generated unit tests. Consequently, human involvement is not eliminated but rather shifted: from interacting directly with the LLM to designing or generating appropriate unit tests. - The authors use examples from the 24 Game without providing sufficient explanations of the game itself. Including a brief description of the game in the Background section, ideally expanding the existing explanation on l

Reviewer 02Rating 4Confidence 4

Strengths

- The main contribution, according to the authors, is that they can leverage all the existing approaches in software engineering and LLM-based code generation to create an automated feedback mechanism thus avoiding need for human intervention. (But it looks like the human still has to generate some unit tests, unless you are using formal specs of the environment?) - They show that this approach leads to 100% accuracy (just like the manually intervened ToS models). - The error analysis is good sh

Weaknesses

- While I like the paper's approach this seems like a very minor contribution and primarily relies on test-based checks to achieve full 100% automation. The choice of problems might have been restricted to those where this is possible. I am not sure how this can be used as a general reasoning mechanism for LLM applications. - While I commended the error analysis above, this also feels like an opportunity to use this to see what needs to be done so you can get uniform performance across models an

Reviewer 03Rating 2Confidence 3

Strengths

The paper is generally written clearly. This direction of reducing LLM dependence on human feedback for complex reasoning has potential.

Weaknesses

I don’t see how this method can be really useful in more practical scenarios. Since the experiment setting is very toy, if you only focus on tasks with available `isgoal` and `succ` functions, you can actually replace the LLM call with very simple human-written functions, even simpler than the prompts you write. Besides, I don't believe the 100% accuracy in the experiment section is meaningful. It is trivial to write a simple search method to achieve 100\% accuracy on all of them, as long as y

Reviewer 04Rating 4Confidence 4

Strengths

- The paper is easy to follow, and the contributions of the paper in the context of existing works are clearly communicated. - The paper presents an important step towards automation of ToS enabling automated planning without requiring humans in the loop. - The paper’s focus on the necessity of soundness and completeness in the context of planning (which are often overshadowed in works that tend to use LLMs directly as planners) is well-appreciated, as they are key to achieving good and reliable

Weaknesses

- The paper is incremental as it takes one specific step of the prior approach, and automates that step with an LLM-in-the-loop mechanism replacing the human-in-the-loop mechanism of prior approach. While this is an important extension, this limits the novelty and impacts of the paper to the broader research community. - The claim about 100% accuracy feels underwhelming since it is based on success in any of the 5 trials (using an external validator). It would be interesting to see what average

Reviewer 05Rating 4Confidence 4

Strengths

1. It takes out the human in the loop for evaluation which makes it more usefull for practical application. 2. It allows for the use of smaller models since it splits to planning proplen into steps. 3. It is straigh forward to implement to replicated. 4. Good evaluation with a number of models and good number of problems.

Weaknesses

1. It is expensive as each step needs a LLM call and this comes also with some latency. 2. The paper claims 100% accuracy on tested datasets. This is great but comes with the downside that it does not allow access to the limits too. Benchmarks marks would be preferable that provide some space. Especially, blocksworld that would be possible by using more blocks or needs more plannings steps to solve. 3. The paper runs a best-of-5 as they have a validator so this seems to make it more expenstiv

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games

MethodsSoftmax · Attention Is All You Need