Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning
Huchen Jiang, Yangyang Ma, Chaofan Ding, Kexin Luan, Xinhan Di

TL;DR
This paper proposes an intrinsic self-correction method for Large Language Models to improve step-wise reasoning by leveraging iterative preference learning and self-generated data, resulting in significant accuracy improvements on arithmetic tasks.
Contribution
It introduces a two-stage training process that enhances LLM reasoning through intrinsic self-correction and preference learning, advancing beyond existing iterative methods.
Findings
Outperforms baseline models on MATH and GSM8K datasets.
Achieves up to 4.94% accuracy improvement.
Demonstrates effectiveness of self-correction in reasoning tasks.
Abstract
With current state-of-the-art approaches aimed at enhancing the reasoning capabilities of Large Language Models(LLMs) through iterative preference learning inspired by AlphaZero, we propose to further enhance the step-wise reasoning capabilities through intrinsic self-correction to some extent. Our work leverages step-wise preference learning to enhance self-verification via reinforcement learning. We initially conduct our work through a two-stage training procedure. At the first stage, the self-correction reasoning ability of an LLM is enhanced through its own predictions, relying entirely on self-generated data within the intrinsic self-correction to some extent. At the second stage, the baseline step-wise preference learning is leveraged via the application of the enhanced self-correct policy achieved at the first stage. In the evaluation of arithmetic reasoning tasks, our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Data Management and Algorithms · Data Mining Algorithms and Applications
MethodsAlphaZero
