Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of   LLMs

Xin Lai; Zhuotao Tian; Yukang Chen; Senqiao Yang; Xiangru Peng; Jiaya; Jia

arXiv:2406.18629·cs.LG·June 28, 2024·1 cites

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, Jiaya, Jia

PDF

Open Access 1 Repo 10 Models 2 Datasets

TL;DR

Step-DPO introduces a step-wise preference optimization method that improves the reasoning accuracy of large language models on mathematical tasks by focusing on individual reasoning steps rather than entire answers.

Contribution

The paper proposes a novel step-wise preference optimization approach and a data construction pipeline, significantly enhancing LLMs' mathematical reasoning accuracy with minimal data and training.

Findings

01

Achieves nearly 3% accuracy gain on MATH with 10K preference pairs.

02

Outperforms GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro on MATH and GSM8K.

03

Self-generated data is more effective than human or GPT-4 data for DPO.

Abstract

Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dvlab-research/step-dpo
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis · Semantic Web and Ontologies · Scheduling and Optimization Algorithms

MethodsAttention Is All You Need · Direct Preference Optimization · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam