Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning
Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P., Lillicrap, Kenji Kawaguchi, Michael Shieh

TL;DR
This paper presents a novel method that enhances large language models' reasoning by combining Monte Carlo Tree Search with iterative preference learning and direct preference optimization, leading to significant performance improvements.
Contribution
It introduces an iterative preference learning framework using MCTS and DPO to improve LLM reasoning, inspired by AlphaZero's success, with theoretical and empirical validation.
Findings
Outperforms baseline models on reasoning benchmarks
Achieves up to 15.8% accuracy improvement on ARC-C
Provides insights into compute tradeoffs for training and inference
Abstract
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process inspired by the successful strategy employed by AlphaZero. Our work leverages Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. To enhance consistency in intermediate steps, we combine outcome validation and stepwise self-evaluation, continually updating the quality assessment of newly generated data. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data. Theoretical analysis reveals the importance of using on-policy sampled data for successful self-improving. Extensive evaluations on various arithmetic and commonsense…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Data Management and Algorithms · Data Mining Algorithms and Applications
MethodsAlphaZero
