ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving
Haoyuan Wu, Xueyi Chen, Rui Ming, Jilong Gao, Shoubo Hu, Zhuolun He, Bei Yu

TL;DR
This paper introduces ToTRL, a reinforcement learning framework that enhances large language models' tree-of-thought reasoning by training them through puzzle-solving tasks, leading to improved reasoning performance and efficiency.
Contribution
We propose a novel on-policy RL method, ToTRL, that trains LLMs to develop tree-of-thought reasoning strategies using puzzle games, advancing systematic and efficient reasoning capabilities.
Findings
ToTRL-trained models outperform baselines on complex reasoning tasks.
Models exhibit improved reasoning efficiency and reduced token usage.
Puzzle-based training effectively cultivates tree-of-thought reasoning in LLMs.
Abstract
Large language models (LLMs) demonstrate significant reasoning capabilities, particularly through long chain-of-thought (CoT) processes, which can be elicited by reinforcement learning (RL). However, prolonged CoT reasoning presents limitations, primarily verbose outputs due to excessive introspection. The reasoning process in these LLMs often appears to follow a trial-and-error methodology rather than a systematic, logical deduction. In contrast, tree-of-thoughts (ToT) offers a conceptually more advanced approach by modeling reasoning as an exploration within a tree structure. This reasoning structure facilitates the parallel generation and evaluation of multiple reasoning branches, allowing for the active identification, assessment, and pruning of unproductive paths. This process can potentially lead to improved performance and reduced token costs. Building upon the long CoT…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper is extremely well written and details are fleshed out to support reproducibility - The experimental results show significant improvements based on a Qwen model that the authors have trained/fine-tuned. - The authors also demonstrate their approach in a test-time-scaling experiment and show that the learned policy is good to explore the search space better.
- In the beginning of the paper, the authors mention that "Initially, the LLM is trained to perform ToT reasoning in a non-thinking mode, leveraging more moldable thinking patterns to activate ToT reasoning. Once the LLM has developed a degree of ToT reasoning ability in the non-reasoning mode, it undergoes further training in the reasoning mode." This wasn't re-referred back later in the paper. Can you show/demonstrate examples of these patterns that activate ToT? Can you show ablation results
The paper tackles an important problem of how to design an effective training procedure of improving parallel thinking techniques like TOT.
- The motivation to adapt CoT to ToT reasoning is not well justified. It remains unclear in what sense is the linear COT unsuitable under the TOT setting, and whether the gain from 2-stage training is just due to extended-training. - It doesn't seem to be convincing that by applying RL on only two puzzle tasks, the model performance can be improved over a wide range of reasoning tasks. The claim of the title that training on puzzle tasks can unlock the potential of ToT is very broad and needs de
Introducing parallel thinking patterns into reasoning LLMs sounds a reasonable effort. The ToTQwen3-8B model shows significant performance gains on a variety of logic puzzles.
Since the authors are still leveraging the CoT prompt for mathematical problems, it is unclear to me why it improves OOD mathematical tasks. Can you provide analysis as to why it also helps mathematical tasks? I am particularly curious why ToTQwen3-8B can “explore the solution space more effectively and efficiently” as the authors mentioned, given that ToT is often very costly. Some experiment setting details of Figure 3 in section 3.5 are unclear. How do you set the budgets as exactly (2^c) k
S1: The paper addresses the well-known inefficiency and verbosity of long CoT reasoning by enabling branching exploration with a global perspective, aligning with prior work in ToT and graph-based reasoning. S2: The use of a rule-based validator combined with an exact-match reward is straightforward to reproduce for puzzle tasks and eliminates dependence on human-labeled rationales, reflecting trends in O1/R1-style RL frameworks. S3: Empirical results show that ToTQwen3‑8B achieves higher accu
W1: Equation (1) resembles a PPO-style clipped objective with an optional KL term to a reference model. Calling it REINFORCE may obscure the actual optimization method used. W2: The exact set-equality reward (Eq. 5) is brittle; success may hinge on precise formatting or extraction of answers, which could inflate performance or reduce reproducibility. W3: Comparisons omit relevant search-based alternatives, including: (i) self-consistency over CoT traces, (ii) explicit ToT BFS/MCTS as in the or
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Games · Reinforcement Learning in Robotics
MethodsPruning
