ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, Jie, Tang

TL;DR
ReST-MCTS* introduces a reinforcement learning approach combining process reward guidance with tree search to generate high-quality reasoning traces, improving LLM self-training and reasoning accuracy without manual annotation.
Contribution
The paper presents ReST-MCTS*, a novel method that infers process rewards through tree search, enhancing self-training of LLMs with higher-quality reasoning traces and outperforming prior methods.
Findings
Tree-search policy achieves higher accuracy than prior baselines.
Using searched traces improves models over multiple iterations.
ReST-MCTS* outperforms ReST$^ ext{EM}$ and Self-Rewarding LM.
Abstract
Recent methodologies in LLM self-training mostly rely on LLM generating responses and filtering those with correct output answers as training data. This approach often yields a low-quality fine-tuning training set (e.g., incorrect plans or intermediate reasoning). In this paper, we develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. ReST-MCTS* circumvents the per-step manual annotation typically used to train process rewards by tree-search-based reinforcement learning: Given oracle final correct answers, ReST-MCTS* is able to infer the correct process rewards by estimating the probability this step can help lead to the correct answer. These inferred rewards serve dual purposes: they act as value…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis
MethodsSparse Evolutionary Training
