RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation

Qingyao Li; Wei Xia; Kounianhua Du; Xinyi Dai; Ruiming Tang; Yasheng Wang; Yong Yu; Weinan Zhang

arXiv:2409.09584·cs.SE·October 27, 2025

RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation

Qingyao Li, Wei Xia, Kounianhua Du, Xinyi Dai, Ruiming Tang, Yasheng Wang, Yong Yu, Weinan Zhang

PDF

Open Access 1 Video 3 Reviews

TL;DR

RethinkMCTS introduces a novel framework combining Monte Carlo Tree Search with a refinement process that uses execution feedback to improve reasoning and code generation quality.

Contribution

The paper presents RethinkMCTS, a new method that systematically refines reasoning in code generation by integrating MCTS with a feedback-driven refinement mechanism.

Findings

01

Outperforms previous search-based code generation methods

02

Improves reasoning quality through feedback-driven refinement

03

Enhances search efficiency and accuracy in code generation

Abstract

Tree search methods have demonstrated impressive performance in code generation. Previous methods combine tree search with reflection that summarizes past mistakes to achieve iterative improvement. However, these methods face significant challenges. First, they search directly within the code language space, neglecting the underlying reasoning process critical for effective code generation. Second, reflection-based approaches merely accumulate historical errors in memory without providing correct reasoning pathways, making it difficult for subsequent search iterations to identify optimal solutions, resulting in decreased search quality. In this work, we propose RethinkMCTS, a framework that systematically explores and refines the reasoning process for code generation. Specifically, we employ MCTS to search for thoughts before code generation and integrate MCTS with a refinement…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

- The problem, search in LLM code generation, is very relevant - Ablation in Figure 3 shows each component plays certain role in the framework

Weaknesses

- The evaluation strategy and refinement in tree search appears to be stitched for final performance but are totally irrelevant strategies. As the paper is titled as RethinkMCTS, adding additional feedback effort may make it more difficult for the audience to evaluate the role of ``rethink''. - for example, RethinkMCTS without VF (39 in Figure 3) can be lower than PG-TD (40 in Table 1) for APPS Intro. It is not clear what is the performance of rethink only, in comparison with the baselines,

Reviewer 02Rating 3Confidence 3

Strengths

1. This paper is well-written and easy to understand. The various feedback and the rethink process are integrated quite well. 2. The included baselines are comprehensive.

Weaknesses

1. The novelty of this paper seems limited. I am struggling to understand the difference from the LATS and ToT work. In comparison, it seems like the proposed method is essentially LATS + ToT? If that is correct, I think the contribution is of limited significance. 2. The improvements over baselines, especially ToT, are small. With GPT-4o-mini, the differences were quite small. As there are no confidence intervals, there is doubt about the statistical significance of those improvements. 3. It

Reviewer 03Rating 6Confidence 5

Strengths

- The paper proposes a neat, conceptually simple way of doing MCTS in the space of LLMs for code generation. - This work suggests an alternative to current approaches to handle incomplete coverage of solutions by public test cases by proposing LLM self-assessment instead of generating synthetic test cases that may not always be accurate. - The method section is written well, clearly outlining the different parts of the MCTS algorithm and what they correspond to in this context, as well as the ad

Weaknesses

- In Table 1, the RethinkMCTS results have been marked in bold, which I presume indicates the superior performance of this method over other baselines. However, it seems to be the case that for GPT-4o-mini, the results shown by the Tree-of-Thought baseline are in most cases, at par with RethinkMCTS, and the pass rate of PG-TD on APPS-Comp is actually higher than that of RethinkMCTS. This table would be clearer to read if the best baselines that are at par with or better than RethinkMCTS were als

Videos

RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Byte Pair Encoding · Softmax · Layer Normalization · Dropout · Residual Connection · Attention Dropout · Linear Layer