Step-level Value Preference Optimization for Mathematical Reasoning

Guoxin Chen; Minpeng Liao; Chengxi Li; Kai Fan

arXiv:2406.10858·cs.CL·September 30, 2024

Step-level Value Preference Optimization for Mathematical Reasoning

Guoxin Chen, Minpeng Liao, Chengxi Li, Kai Fan

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces SVPO, a novel method that enhances mathematical reasoning in large language models by using step-level preferences and Monte Carlo Tree Search to improve fine-grained output quality.

Contribution

The paper presents SVPO, combining step-level preference annotation with a learned value model to improve multi-step reasoning in LLMs, surpassing previous methods.

Findings

01

Achieves state-of-the-art results on mathematical reasoning benchmarks.

02

Effectively captures fine-grained step-level preferences.

03

Reduces inference costs with the explicit value model.

Abstract

Direct Preference Optimization (DPO) using an implicit reward model has proven to be an effective alternative to reinforcement learning from human feedback (RLHF) for fine-tuning preference aligned large language models (LLMs). However, the overall preference annotations of responses do not fully capture the fine-grained quality of model outputs in complex multi-step reasoning tasks, such as mathematical reasoning. To address this limitation, we introduce a novel algorithm called Step-level Value Preference Optimization (SVPO). Our approach employs Monte Carlo Tree Search (MCTS) to automatically annotate step-level preferences for multi-step reasoning. Furthermore, from the perspective of learning-to-rank, we train an explicit value model to replicate the behavior of the implicit reward model, complementing standard preference optimization. This value model enables the LLM to generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MARIO-Math-Reasoning/Super_MARIO
noneOfficial

Models

🤗
MARIO-Math-Reasoning/SVPO_7B
model· 6 dl· ♡ 4
6 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Advanced Algebra and Logic · Logic, Reasoning, and Knowledge