LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level   Mathematical Reasoning

Di Zhang; Jianbo Wu; Jingdi Lei; Tong Che; Jiatong Li; Tong Xie,; Xiaoshui Huang; Shufei Zhang; Marco Pavone; Yuqiang Li; Wanli Ouyang,; Dongzhan Zhou

arXiv:2410.02884·cs.AI·November 22, 2024·3 cites

LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie,, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang,, Dongzhan Zhou

PDF

Open Access 1 Repo 6 Models 5 Datasets

TL;DR

LLaMA-Berry introduces a novel framework combining Monte Carlo Tree Search, Self-Refine, and pairwise reward modeling to significantly improve mathematical reasoning in Large Language Models, especially for complex Olympiad problems.

Contribution

The paper presents a new optimization framework that integrates pairwise reward models with MCTS and Self-Refine, enhancing reasoning efficiency and accuracy in LLMs for advanced mathematical tasks.

Findings

01

Outperforms existing methods like ToT and rStar on Olympiad benchmarks.

02

Achieves higher problem-solving accuracy and search efficiency.

03

Effective in complex and diverse mathematical reasoning tasks.

Abstract

This paper presents an advanced mathematical problem-solving framework, LLaMA-Berry, for enhancing the mathematical reasoning ability of Large Language Models (LLMs). The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path and utilizes a pairwise reward model to evaluate different paths globally. By leveraging the self-critic and rewriting capabilities of LLMs, Self-Refine applied to MCTS (SR-MCTS) overcomes the inefficiencies and limitations of conventional step-wise and greedy search algorithms by fostering a more efficient exploration of solution spaces. Pairwise Preference Reward Model~(PPRM), inspired by Reinforcement Learning from Human Feedback (RLHF), is then used to model pairwise preferences between solutions, utilizing an Enhanced Borda Count (EBC) method to synthesize these preferences into a global ranking score to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

trotsky1997/mathblackbox
none

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning