Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning
Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan,, Mingjie Zhan, Hongsheng Li

TL;DR
This paper introduces Step-Controlled DPO (SCDPO), a novel training method that improves mathematical reasoning in large language models by providing stepwise error supervision, leading to significant performance gains.
Contribution
SCDPO automatically generates stepwise error signals for better alignment of LLMs in reasoning tasks, enhancing their accuracy and error understanding.
Findings
SCDPO improves reasoning accuracy over naive DPO across multiple models.
Applying SCDPO yields high scores on GSM8K and MATH benchmarks.
SCDPO effectively identifies errors in mathematical solutions.
Abstract
Direct Preference Optimization (DPO) has proven effective at improving the performance of large language models (LLMs) on downstream tasks such as reasoning and alignment. In this work, we propose Step-Controlled DPO (SCDPO), a method for automatically providing stepwise error supervision by creating negative samples of mathematical reasoning rationales that start making errors at a specified step. By applying these samples in DPO training, SCDPO can better align the model to understand reasoning errors and output accurate reasoning steps. We apply SCDPO to both code-integrated and chain-of-thought solutions, empirically showing that it consistently improves the performance compared to naive DPO on three different SFT models, including one existing SFT model and two models we finetuned. Qualitative analysis of the credit assignment of SCDPO and DPO demonstrates the effectiveness of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
MethodsDirect Preference Optimization · ALIGN · Shrink and Fine-Tune
