Monte Carlo Tree Search Boosts Reasoning via Iterative Preference   Learning

Yuxi Xie; Anirudh Goyal; Wenyue Zheng; Min-Yen Kan; Timothy P.; Lillicrap; Kenji Kawaguchi; Michael Shieh

arXiv:2405.00451·cs.AI·June 19, 2024·3 cites

Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P., Lillicrap, Kenji Kawaguchi, Michael Shieh

PDF

Open Access 2 Repos

TL;DR

This paper presents a novel method that enhances large language models' reasoning by combining Monte Carlo Tree Search with iterative preference learning and direct preference optimization, leading to significant performance improvements.

Contribution

It introduces an iterative preference learning framework using MCTS and DPO to improve LLM reasoning, inspired by AlphaZero's success, with theoretical and empirical validation.

Findings

01

Outperforms baseline models on reasoning benchmarks

02

Achieves up to 15.8% accuracy improvement on ARC-C

03

Provides insights into compute tradeoffs for training and inference

Abstract

We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process inspired by the successful strategy employed by AlphaZero. Our work leverages Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. To enhance consistency in intermediate steps, we combine outcome validation and stepwise self-evaluation, continually updating the quality assessment of newly generated data. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data. Theoretical analysis reveals the importance of using on-policy sampled data for successful self-improving. Extensive evaluations on various arithmetic and commonsense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTime Series Analysis and Forecasting · Data Management and Algorithms · Data Mining Algorithms and Applications

MethodsAlphaZero