Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, Jiaya, Jia

TL;DR
Step-DPO introduces a step-wise preference optimization method that improves the reasoning accuracy of large language models on mathematical tasks by focusing on individual reasoning steps rather than entire answers.
Contribution
The paper proposes a novel step-wise preference optimization approach and a data construction pipeline, significantly enhancing LLMs' mathematical reasoning accuracy with minimal data and training.
Findings
Achieves nearly 3% accuracy gain on MATH with 10K preference pairs.
Outperforms GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro on MATH and GSM8K.
Self-generated data is more effective than human or GPT-4 data for DPO.
Abstract
Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗xinlai/Llama-3-70B-SFT-Step-DPOmodel· 4 dl4 dl
- 🤗xinlai/Qwen2-72B-Instruct-Step-DPOmodel· 1 dl1 dl
- 🤗xinlai/Qwen2-7B-SFT-Step-DPOmodel· 3 dl3 dl
- 🤗xinlai/DeepSeekMath-RL-Step-DPOmodel· 8 dl· ♡ 28 dl♡ 2
- 🤗xinlai/DeepSeekMath-Base-SFT-Step-DPOmodel· 1 dl1 dl
- 🤗xinlai/Qwen2-57B-A14B-SFT-Step-DPOmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗xinlai/Qwen1.5-32B-SFT-Step-DPOmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗RichardErkhov/xinlai_-_Qwen2-7B-SFT-Step-DPO-ggufmodel· 30 dl30 dl
- 🤗Zhihu-ai/Zhi-Create-DSR1-14Bmodel· 39 dl· ♡ 1739 dl♡ 17
- 🤗Zhihu-ai/Zhi-Create-DSR1-14B-GPTQ-INT4model· 6 dl· ♡ 126 dl♡ 12
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis · Semantic Web and Ontologies · Scheduling and Optimization Algorithms
MethodsAttention Is All You Need · Direct Preference Optimization · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam
