ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving

Haoyuan Wu; Xueyi Chen; Rui Ming; Jilong Gao; Shoubo Hu; Zhuolun He; Bei Yu

arXiv:2505.12717·cs.CL·December 29, 2025

ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving

Haoyuan Wu, Xueyi Chen, Rui Ming, Jilong Gao, Shoubo Hu, Zhuolun He, Bei Yu

PDF

Open Access 4 Reviews

TL;DR

This paper introduces ToTRL, a reinforcement learning framework that enhances large language models' tree-of-thought reasoning by training them through puzzle-solving tasks, leading to improved reasoning performance and efficiency.

Contribution

We propose a novel on-policy RL method, ToTRL, that trains LLMs to develop tree-of-thought reasoning strategies using puzzle games, advancing systematic and efficient reasoning capabilities.

Findings

01

ToTRL-trained models outperform baselines on complex reasoning tasks.

02

Models exhibit improved reasoning efficiency and reduced token usage.

03

Puzzle-based training effectively cultivates tree-of-thought reasoning in LLMs.

Abstract

Large language models (LLMs) demonstrate significant reasoning capabilities, particularly through long chain-of-thought (CoT) processes, which can be elicited by reinforcement learning (RL). However, prolonged CoT reasoning presents limitations, primarily verbose outputs due to excessive introspection. The reasoning process in these LLMs often appears to follow a trial-and-error methodology rather than a systematic, logical deduction. In contrast, tree-of-thoughts (ToT) offers a conceptually more advanced approach by modeling reasoning as an exploration within a tree structure. This reasoning structure facilitates the parallel generation and evaluation of multiple reasoning branches, allowing for the active identification, assessment, and pruning of unproductive paths. This process can potentially lead to improved performance and reduced token costs. Building upon the long CoT…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 8Confidence 3

Strengths

- The paper is extremely well written and details are fleshed out to support reproducibility - The experimental results show significant improvements based on a Qwen model that the authors have trained/fine-tuned. - The authors also demonstrate their approach in a test-time-scaling experiment and show that the learned policy is good to explore the search space better.

Weaknesses

- In the beginning of the paper, the authors mention that "Initially, the LLM is trained to perform ToT reasoning in a non-thinking mode, leveraging more moldable thinking patterns to activate ToT reasoning. Once the LLM has developed a degree of ToT reasoning ability in the non-reasoning mode, it undergoes further training in the reasoning mode." This wasn't re-referred back later in the paper. Can you show/demonstrate examples of these patterns that activate ToT? Can you show ablation results

Reviewer 02Rating 2Confidence 4

Strengths

The paper tackles an important problem of how to design an effective training procedure of improving parallel thinking techniques like TOT.

Weaknesses

- The motivation to adapt CoT to ToT reasoning is not well justified. It remains unclear in what sense is the linear COT unsuitable under the TOT setting, and whether the gain from 2-stage training is just due to extended-training. - It doesn't seem to be convincing that by applying RL on only two puzzle tasks, the model performance can be improved over a wide range of reasoning tasks. The claim of the title that training on puzzle tasks can unlock the potential of ToT is very broad and needs de

Reviewer 03Rating 4Confidence 3

Strengths

Introducing parallel thinking patterns into reasoning LLMs sounds a reasonable effort. The ToTQwen3-8B model shows significant performance gains on a variety of logic puzzles.

Weaknesses

Since the authors are still leveraging the CoT prompt for mathematical problems, it is unclear to me why it improves OOD mathematical tasks. Can you provide analysis as to why it also helps mathematical tasks? I am particularly curious why ToTQwen3-8B can “explore the solution space more effectively and efficiently” as the authors mentioned, given that ToT is often very costly. Some experiment setting details of Figure 3 in section 3.5 are unclear. How do you set the budgets as exactly (2^c) k

Reviewer 04Rating 4Confidence 3

Strengths

S1: The paper addresses the well-known inefficiency and verbosity of long CoT reasoning by enabling branching exploration with a global perspective, aligning with prior work in ToT and graph-based reasoning. S2: The use of a rule-based validator combined with an exact-match reward is straightforward to reproduce for puzzle tasks and eliminates dependence on human-labeled rationales, reflecting trends in O1/R1-style RL frameworks. S3: Empirical results show that ToTQwen3‑8B achieves higher accu

Weaknesses

W1: Equation (1) resembles a PPO-style clipped objective with an optional KL term to a reference model. Calling it REINFORCE may obscure the actual optimization method used. W2: The exact set-equality reward (Eq. 5) is brittle; success may hinge on precise formatting or extraction of answers, which could inflate performance or reduce reproducibility. W3: Comparisons omit relevant search-based alternatives, including: (i) self-consistency over CoT traces, (ii) explicit ToT BFS/MCTS as in the or

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Games · Reinforcement Learning in Robotics

MethodsPruning