Step-level Value Preference Optimization for Mathematical Reasoning
Guoxin Chen, Minpeng Liao, Chengxi Li, Kai Fan

TL;DR
This paper introduces SVPO, a novel method that enhances mathematical reasoning in large language models by using step-level preferences and Monte Carlo Tree Search to improve fine-grained output quality.
Contribution
The paper presents SVPO, combining step-level preference annotation with a learned value model to improve multi-step reasoning in LLMs, surpassing previous methods.
Findings
Achieves state-of-the-art results on mathematical reasoning benchmarks.
Effectively captures fine-grained step-level preferences.
Reduces inference costs with the explicit value model.
Abstract
Direct Preference Optimization (DPO) using an implicit reward model has proven to be an effective alternative to reinforcement learning from human feedback (RLHF) for fine-tuning preference aligned large language models (LLMs). However, the overall preference annotations of responses do not fully capture the fine-grained quality of model outputs in complex multi-step reasoning tasks, such as mathematical reasoning. To address this limitation, we introduce a novel algorithm called Step-level Value Preference Optimization (SVPO). Our approach employs Monte Carlo Tree Search (MCTS) to automatically annotate step-level preferences for multi-step reasoning. Furthermore, from the perspective of learning-to-rank, we train an explicit value model to replicate the behavior of the implicit reward model, complementing standard preference optimization. This value model enables the LLM to generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Advanced Algebra and Logic · Logic, Reasoning, and Knowledge
