ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search

Dan Zhang; Sining Zhoubian; Ziniu Hu; Yisong Yue; Yuxiao Dong; Jie; Tang

arXiv:2406.03816·cs.CL·November 19, 2024·6 cites

ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search

Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, Jie, Tang

PDF

Open Access 2 Repos 1 Datasets

TL;DR

ReST-MCTS* introduces a reinforcement learning approach combining process reward guidance with tree search to generate high-quality reasoning traces, improving LLM self-training and reasoning accuracy without manual annotation.

Contribution

The paper presents ReST-MCTS*, a novel method that infers process rewards through tree search, enhancing self-training of LLMs with higher-quality reasoning traces and outperforming prior methods.

Findings

01

Tree-search policy achieves higher accuracy than prior baselines.

02

Using searched traces improves models over multiple iterations.

03

ReST-MCTS* outperforms ReST$^ ext{EM}$ and Self-Rewarding LM.

Abstract

Recent methodologies in LLM self-training mostly rely on LLM generating responses and filtering those with correct output answers as training data. This approach often yields a low-quality fine-tuning training set (e.g., incorrect plans or intermediate reasoning). In this paper, we develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. ReST-MCTS* circumvents the per-step manual annotation typically used to train process rewards by tree-search-based reinforcement learning: Given oracle final correct answers, ReST-MCTS* is able to infer the correct process rewards by estimating the probability this step can help lead to the correct answer. These inferred rewards serve dual purposes: they act as value…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

rawsh/magpie-ultra-v0.1-PRM-data-base
dataset· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis

MethodsSparse Evolutionary Training