Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical   Reasoning

Zimu Lu; Aojun Zhou; Ke Wang; Houxing Ren; Weikang Shi; Junting Pan,; Mingjie Zhan; Hongsheng Li

arXiv:2407.00782·cs.CL·July 16, 2024

Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan,, Mingjie Zhan, Hongsheng Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces Step-Controlled DPO (SCDPO), a novel training method that improves mathematical reasoning in large language models by providing stepwise error supervision, leading to significant performance gains.

Contribution

SCDPO automatically generates stepwise error signals for better alignment of LLMs in reasoning tasks, enhancing their accuracy and error understanding.

Findings

01

SCDPO improves reasoning accuracy over naive DPO across multiple models.

02

Applying SCDPO yields high scores on GSM8K and MATH benchmarks.

03

SCDPO effectively identifies errors in mathematical solutions.

Abstract

Direct Preference Optimization (DPO) has proven effective at improving the performance of large language models (LLMs) on downstream tasks such as reasoning and alignment. In this work, we propose Step-Controlled DPO (SCDPO), a method for automatically providing stepwise error supervision by creating negative samples of mathematical reasoning rationales that start making errors at a specified step. By applying these samples in DPO training, SCDPO can better align the model to understand reasoning errors and output accurate reasoning steps. We apply SCDPO to both code-integrated and chain-of-thought solutions, empirically showing that it consistently improves the performance compared to naive DPO on three different SFT models, including one existing SFT model and two models we finetuned. Qualitative analysis of the credit assignment of SCDPO and DPO demonstrates the effectiveness of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mathllm/Step-Controlled_DPO
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning

MethodsDirect Preference Optimization · ALIGN · Shrink and Fine-Tune